2018 Edition

Estimating Financial Risk through Monte Carlo Simulation

Risk analysis is part of every decision we make when faced with uncertainty, ambiguity, and variability. Indeed, even though we have unprecedented access to information, we can't accurately predict the future. In finance, there is a fair amount of uncertainty and risk involved with estimating the future value of financial products, due to the wide variety of potential outcomes. Monte Carlo simulation (also known as the Monte Carlo Method) allows inspecting many possible outcomes of the decision making process, and can be used to assess the impact of risk: this, in turns, allows for better decision-making under uncertainty.

Goals

The main objectives we set for this Notebook are as follows:

  1. Develop fundamental knowledge about Risk analysis
  2. Understand Monte Carlo Simulation (MCS)
  3. Apply Monte Carlo Simulation for predicting risk

Steps

  1. First, in section 1, we introduce the basics of MCS
  2. In section 2, we work on a simple example to where we apply the MCS method
  3. In section 3, we briefly summarize the main characteristics of the Monte Carlo Simulation (MCS) technique
  4. In section 4, we overview the common distributions which are often used in MCS
  5. In section 5, we work on a real use case, that focuses on estimating financial risk. We will use techniques such as featurization (that is, generating additional features to improve model accuracy), linear regression, kernel density estimation, sampling distributions and so on ...

Reference

This Notebook is inspired by Chapter 9 of the book Advanced Analytics with Spark by Josh Wills, Sandy Ryza, Sean Owen, and Uri Laserson. It is strongly suggested to read this Chapter to get a general idea of the topic of this Notebook.

1. Introduction

1.1. Monte Carlo Simulation (MCS)

Monte Carlo simulation is a computerized mathematical technique that can be applied such that it is possible to account for risk in quantitative analysis and decision making. This technique is used in many different fields, such as R&D, risk management, portfolio management, pricing derivatives, strategic planning, project planning, cost modeling and many more.

In general, MCS is a technique that "converts" uncertainty on input variables of a model into probability distributions. By combining the distributions and randomly selecting values from them, it recalculates the simulated model many times, to determine the probability of the output.

Historically, this technique was first used by scientists working on the atomic bomb: it was named after Monte Carlo, the Monaco resort town renowned for its casinos. Since its introduction in World War II, Monte Carlo simulation has been used to model a variety of physical and conceptual systems.

1.2. How does it work?

Monte Carlo simulation performs risk analysis by building models of possible results by substituting a range of possible input values, that constitute uncertainty, into a statistical distribution. It then computes possible outcomes repeatedly, each time using a different set of random values from the probability functions that "model" the input. Depending upon the number of random input variables and their distribution, a Monte Carlo simulation could involve thousands or tens of thousands of "rounds" before it is complete. When complete, Monte Carlo simulation produces distributions of possible outcome values.

By using probability distributions instead of actual input samples, it is possible to model more accurately uncertainty: different choices of distributions will yield different outputs.

2. Illustrative example

Imagine you are the marketing manager for a firm that is planning to introduce a new product. You need to estimate the first-year net profit from this product, which might depend on:

  • Sales volume in units
  • Price per unit (also called "Selling price")
  • Unit cost
  • Fixed costs

Net profit will be calculated as $Net Profit = Sales Volume* (Selling Price - Unit cost) - Fixed costs$. Fixed costs (accounting for various overheads, advertising budget, etc.) are known to be \$ 120,000, which we assume to be deterministic. All other factors, instead, involve some uncertainty: sales volume (in units) can cover quite a large range, the selling price per unit will depend on competitor actions, which are hard to predict, and unit costs will also vary depending on vendor prices and production experience, for example.

Now, to build a risk analysis model, we must first identify the uncertain variables -- which are essentially random variables. While there's some uncertainty in almost all variables in a business model, we want to focus on variables where the range of values is significant.

In [65]:
from datetime import datetime, timedelta
from itertools import islice
from os import listdir
from os.path import isfile, join
from scipy import stats
from IPython.display import display
from sklearn.metrics import mean_squared_error
from statsmodels.nonparametric.kernel_density import KDEMultivariate
from statsmodels.nonparametric.kde import KDEUnivariate
from statsmodels.graphics.gofplots import qqplot_2samples
%matplotlib inline
import numpy as np
import statsmodels.api as sm
import random
import pandas as pd
import matplotlib.pyplot as plt
import scipy
import seaborn as sns
import math
from sklearn.mixture import GaussianMixture

plt.style.use(['default', 'bmh'])

base_folder = "monte-carlo-risk/"
stock_folder = base_folder + 'stocks'
factors_folder= base_folder + "factors/"

2.1. Unit sales and unit price

Based on a hypothetical market research you have done, you have beliefs that there are equal chances for the market to be slow, normal, or hot:

  • In a "slow" market, you expect to sell 50,000 units at an average selling price of \$11.00 per unit
  • In a "normal" market, you expect to sell 75,000 units, but you'll likely realize a lower average selling price of \$10.00 per unit
  • In a "hot" market, you expect to sell 100,000 units, but this will bring in competitors, who will drive down the average selling price to \$8.00 per unit

Question 1

Calculate the average units and the unit price that you expect to sell, which depend on the market state. Use the assumptions above to compute the expected quantity of products and their expected unit price.
In [66]:
average_unit = (50000 + 75000 + 100000) / 3
average_price = (11.0 + 10.0 + 8.0) / 3

print("average unit:", average_unit)
print("average_price:", average_price)
average unit: 75000.0
average_price: 9.666666666666666

2.2. Unit Cost

Another uncertain variable is Unit Cost. In our illustrative example, we assume that your firm's production manager advises you that unit costs may be anywhere from \$5.50 to \$7.50, with a most likely expected cost of \$6.50. In this case, the most likely cost can be considered as the average cost.

2.3. A Flawed Model: using averages to represent our random variables

Our next step is to identify uncertain functions -- also called functions of a random variable. Recall that Net Profit is calculated as $Net Profit = Sales Volume * (Selling Price - Unit cost) - Fixed costs$. However, Sales Volume, Selling Price and Unit Cost are all uncertain variables, so Net Profit is an uncertain function.

The simplest model to predict the Net Profit is using average of sales volume, average of selling price and average of unit cost for calculating. So, if only consider averages, we can say that the $Net Profit = 75,000*(9.66666666 - 6.5) - 120,000 \sim 117,500$.

However, as Dr. Sam Savage warns, "Plans based on average assumptions will be wrong on average." The calculated result is far from the actual value: indeed, the true average Net Profit is roughly \$93,000, as we will see later in the example.

Question 2

Question 2.1

Write a function named `calNetProfit` to calculate the Net Profit using the average of sales volume, the average of selling price and the average of unit cost.
In [67]:
def calNetProfit(average_unit, average_price, average_unitcost, fixed_cost):
    return average_unit * (average_price - average_unitcost) - fixed_cost

average_unitcost = 6.5
fixed_cost = 120000
NetProfit = calNetProfit(average_unit, average_price, average_unitcost, fixed_cost)
print("Net profit:", NetProfit)
Net profit: 117499.99999999994

Question 2.2

Verify the warning message of Dr. Sam Savage by calculating the error of our estimated Net Profit using averages only. Recall that the true value is roughly \$93,000, so we are interested in:
    $$ error = \frac{your\_value - true\_value}{true\_value}$$
      Note also we are interested in displaying the error as a percentage.
        Looking at the error we make, do you think that we can use the current model that only relies on averages?
        In [68]:
        trueNetProfit = 93000
        error = (NetProfit - trueNetProfit) / (trueNetProfit)
        print("Error in percentage:", error * 100, " %")
        
        Error in percentage: 26.344086021505316  %
        
        COMMENT
        The error is very high as expected. Using intuition to plug the average value of our uncertain inputs (Sales Volume, Selling Price, and Unit Cost) into the model in order to produce the average value of the output (Net Profit) is wrong and misleading. In fact, this reminds us of Jensen's inequality, which states that:

        $\varphi(E(X)) \leq E(\varphi(X)) $

        where $ E $ means taking expectation over a random variable $ X $ with any distribution, and $ \varphi $ is a convex function.

        Meaning that, when nonlinear relationships are involved, the value based on average assumptions is not the average value of the entity. Therefore, we cannot rely on this method for critical decisions and we cannot use the current model that only relies on averages.

        2.4. Using the Monte Carlo Simulation method to improve our model

        As discussed before, the selling price and selling volume both depend on the state of the market scenario (slow/normal/hot). So, the net profit is the result of two random variables: market scenario (which in turn determines sales volumes and selling price) and unit cost.

        Now, let's assume (this is an a-priori assumption we make) that market scenario follows a discrete, uniform distribution and that unit cost also follows a uniform distribution. Then, we can compute directly the values for selling price and selling volumes based on the outcome of the random variable market scenario, as shown in Section 2.1.

        From these a-priori distributions, in each run (or trial) of our Monte Carlo simulation, we can generate the sample value for each random variable and use it to calculate the Net Profit. The more simulation runs, the more accurate our results will be. For example, if we run the simulation 100,000 times, the average net profit will amount to roughly \$92,600. Every time we run the simulation, a different prediction will be output: the average of such predictions will consistently be less than \$117,500, which we predicted using averages only.

        Note also that in this simple example, we generate values for the market scenario and unit cost independently: we consider them to be independent random variables. This means that the eventual (and realistic!) correlation between the market scenario and unit cost variables is ignored. Later, we will learn how to be more precise and account for dependency between random variables.

        Question 3

        Question 3.1

        Write a function named `get_sales_volume_price` that returns the sales volume and price based on the market scenario. In particular, the scenario can get one of three values:
        • 0: Slow market
        • 1: Normal market
        • 2: Hot market
        The return value is a tuple in the form: `(sales_volume, price)`
        In [69]:
        # Get sales volume and  price based on market scenario
        # the function returns a tuple of (sales_volume, price)
        def get_sales_volume_price(scenario):
            # Slow market
            if scenario == 0:
                return (50000, 11)
            # Normal market
            if scenario == 1:
                return (75000, 10)    
            # Hot market
            if scenario == 2:
                return (100000, 8)    
        

        Question 3.2

        Run 100,000 Monte Carlo simulations and calculate the average net profit they produce. Then, compare the result to the "average model" we used in the previous questions (the one we called "flawed" model). Put your comments about the discrepancies between a simplistic model, and the more accurate MCS approach.
          Note that in each iteration, the `unit_cost` and `market_scenario` are generated according to their distributions. Also, recall what we have seen in Section 2.2: your firm account manager helped you with some research, to determine the variability of your random variables.
          HINT

          Function uniform(a,b) in module random generates a number $a<=c<=b$, which is drawn from a uniform distribution.

          Function randint(a,b) helps you generating an integer number $a<=c<=b$

          In [70]:
          trueNetProfit = 92600
          num_simulation = 100000
          netProfitAverages, errors = [], []
          
          netProfit = 0
          for i in range(0,num_simulation):
              unit_cost = random.uniform(5.5, 7.5)
              market_scenario = random.randint(0, 2)
              sales_volume, price = get_sales_volume_price(market_scenario)
              netProfit += calNetProfit(sales_volume, price, unit_cost, fixed_cost)
              
              netProfitAverages.append(netProfit / (i+1))
              errors.append(100 * abs(netProfitAverages[-1] - trueNetProfit) / (trueNetProfit))
          
          print("Average net profit:", netProfitAverages[-1])
          print("Error in percentage:", errors[-1], " %")
          
          Average net profit: 92143.9704866483
          Error in percentage: 0.49247247662171334  %
          
          COMMENT
          The error here is very low ($ \simeq 15 \% $), comparing to the previous model based on averages where the error was very high ($ \simeq 26 \% $). However, we ignore here the correlation between the market scenario and unit cost variables.
          The error changes from an execution to another (but still lower than $ \simeq 0.5 \% $). Let's plot the evolution of both the average net profit and the error through simulations:
          In [71]:
          plt.figure(figsize=(20,10))
          plt.subplot(1, 2, 1)
          plt.plot(netProfitAverages)
          plt.axhline(y=trueNetProfit, color='red', linestyle='dotted', linewidth=3)
          plt.yticks(list(plt.yticks()[0]) + [trueNetProfit])
          plt.xlabel("Number of simulations")
          plt.ylabel("Average net profit")
          plt.grid(linestyle='--', axis="y")
          
          plt.subplot(1, 2, 2)
          plt.plot(errors)
          plt.xlabel("Number of simulations")
          plt.ylabel("Error")
          plt.grid(linestyle='--', axis="y")
          plt.show()
          
          As we can see, the number of simulations we chose made the average net profit converge to the true net profit value ($92600$) and the error to $0$. We can see also that the average net profit and the error both converge after only a few simulations (comparing to the number of simulation we chose).

          3. A brief summary of the Monte Carlo Simulation (MCS) technique

          • A MCS allows several inputs to be used at the same time to compute the probability distribution of one or more outputs
          • Different types of probability distributions can be assigned to the inputs of the model, depending on any a-priori information that is available. When the distribution is completely unknown, a common technique is to use a distribution computed by finding the best fit to the data you have
          • The MCS method is also called a stochastic method because it uses random variables. Note also that the general assumption is for input random variables to be independent from each other. When this is not the case, there are techniques to account for correlation between random variables.
          • A MCS generates the output as a range instead of a fixed value and shows how likely the output value is to occur in that range. In other words, the model outputs a probability distribution.

          4. Common distributions used in MCS

          In what follows, we summarize the most common probability distributions that are used as a-priori distributions for input random variables:

          • Normal/Gaussian Distribution: this is a continuous distribution applied in situations where the mean and the standard deviation of a given input variable are given, and the mean represents the most probable value of the variable. In other words, values "near" the mean are most likely to occur. This is symmetric distribution, and it is not bounded in its co-domain. It is very often used to describe natural phenomena, such as people’s heights, inflation rates, energy prices, and so on and so forth. An illustration of a normal distribution is given below: normal_distribution

          • Lognormal Distribution: this is a distribution which is appropriate for variables taking values in the range $[0, \infty]$. Values are positively skewed, not symmetric like a normal distribution. Examples of variables described by some lognormal distributions include, for example, real estate property values, stock prices, and oil reserves. An illustration of a lognormal distribution is given below: log_normal_distribution

          • Triangular Distribution: this is a continuous distribution with fixed minimum and maximum values. It is bounded by the minimum and maximum values and can be either symmetrical (the most probable value = mean = median) or asymmetrical. Values around the most likely value (e.g. the mean) are more likely to occur. Variables that could be described by a triangular distribution include, for example, past sales history per unit of time and inventory levels. An illustration of a triangular distribution is given below:

          • Uniform Distribution: this is a continuous distribution bounded by known minimum and maximum values. In contrast to the triangular distribution, the likelihood of occurrence of the values between the minimum and maximum is the same. In other words, all values have an equal chance of occurring, and the distribution is simply characterized by the minimum and maximum values. Examples of variables that can be described by a uniform distribution include manufacturing costs or future sales revenues for a new product. An illustration of the uniform distribution is given below:

          • Exponential Distribution: this is a continuous distribution used to model the time that pass between independent occurrences, provided that the rate of occurrences is known. An example of the exponential distribution is given below:

          • Discrete Distribution : for this kind of distribution, the "user" defines specific values that may occur and the likelihood of each of them. An example might be the results of a lawsuit: 20% chance of positive verdict, 30% change of negative verdict, 40% chance of settlement, and 10% chance of mistrial.

          5. A real use case: estimating the financial risk of a portfolio of stocks

          We hope that by now you have a good understanding about Monte Carlo simulation. Next, we apply this method to a real use case: financial risk estimation.

          Imagine that you are an investor on the stock market. You plan to buy some stocks and you want to estimate the maximum loss you could incur after two weeks of investing. This is the quantity that the financial statistic "Value at Risk" (VaR) seeks to measure. VaR is defined as a measure of investment risk that can be used as a reasonable estimate of the maximum probable loss for a value of an investment portfolio, over a particular time period. A VaR statistic depends on three parameters: a portfolio, a time period, and a confidence level. A VaR of 1 million dollars with a 95% confidence level over two weeks, indicates the belief that the portfolio stands only a 5% chance of losing more than 1 million dollars over two weeks. VaR has seen widespread use across financial services organizations. This statistic plays a vital role in determining how much cash investors must hold to meet the credit ratings that they seek. In addition, it is also used to understand the risk characteristics of large portfolios: it is a good idea to compute the VaR before executing trades, such that it can help take informed decisions about investments.

          Our goal is calculating VaR of two weeks interval with 95% confidence level and the associated VaR confidence interval.

          5.1. Terminology

          In this use case, we will use some terms that might require a proper definition, given the domain. This is what we call the Domain Knowledge.

          • Instrument: A tradable asset, such as a bond, loan, option, or stock investment. At any particular time, an instrument is considered to have a value, which is the price for which it can be sold. In the use case of this notebook, instruments are stock investments.
          • Portfolio: A collection of instruments owned by a financial institution.
          • Return: The change in an instrument or portfolio’s value over a time period.
          • Loss: A negative return.
          • Index: An imaginary portfolio of instruments. For example, the NASDAQ Composite index includes about 3,000 stocks and similar instruments for major US and international companies.
          • Market factor: A value that can be used as an indicator of macro aspects of the financial climate at a particular time. For example, the value of an index, the Gross Domestic Product of the United States, or the exchange rate between the dollar and the euro. We will often refer to market factors as just factors.

          5.2. The context of our use case

          We have a list of instruments that we plan to invest in. The historical data of each instrument has been collected for you. For simplicity, assume that the returns of instruments at a given time, depend on 4 market factors only:

          • GSPC value
          • IXIC value
          • The return of crude oil
          • The return of treasury bonds

          Our goal is building a model to predict the loss after two weeks' time interval with confidence level set to 95%.

          As a side note, it is important to realize that the approach presented in this Notebook is a simplified version of what would happen in a real Financial firm. For example, the returns of instruments at a given time often depend on more than 4 market factors only! Moreover, the choice of what constitute an appropriate market factor is an art!

          5.3. The Data

          The stock data can be downloaded (or scraped) from Yahoo! by making a series of REST calls. The data includes multiple files. Each file contains the historical information of each instrument that we want to invest in. The data is in the following format (with some samples):

          Date, Open, High, Low, Close, Volume, Adj Close
          2016-01-22,66.239998,68.07,65.449997,67.860001,137400,67.860001
          2016-01-21,65.410004,66.18,64.459999,65.050003,148000,65.050003
          2016-01-20,64.279999,66.32,62.77,65.389999,141300,65.389999
          2016-01-19,67.720001,67.989998,64.720001,65.379997,178400,65.379997

          The data of GSPC and IXIC values (our two first market factors) are also available on Yahoo! and use the very same format.

          The crude oil and treasure bonds data is collected from investing.com, and has a different format, as shown below (with some samples):

          Date    Price   Open    High    Low     Vol.    Change %
          Jan 25, 2016    32.17   32.36   32.44   32.10   -       -0.59%
          Jan 24, 2016    32.37   32.10   32.62   31.99   -       0.54%
          Jan 22, 2016    32.19   29.84   32.35   29.53   -       9.01%
          Jan 21, 2016    29.53   28.35   30.25   27.87   694.04K 11.22%
          Jan 20, 2016    26.55   28.33   28.58   26.19   32.11K  -6.71%
          Jan 19, 2016    28.46   29.20   30.21   28.21   188.03K -5.21%

          In our use case, the factors' data will be used jointly to build a statistical model: as a consequence, we first need to preprocess the data to proceed.

          5.4. Data preprocessing

          In this Notebook, all data files have been downloaded for you, such that you can focus on pre-processing. Next, we will:

          • Read the factor data files which are in two different formats, process and merge them together
          • Read the stock data and pre-process it
          • Trim all data into a specific time region
          • Fill in the missing values
          • Generate the data of returns in each two weeks' time interval window

          Factor data pre-processing

          We need two functions to read and parse data from Yahoo! and Investing.com respectively. We are interested only in information about the time and the corresponding returns of a factor or an instrument: as a consequence, we will project away many columns of our RAW data, and keep only the information we are interested in.

          The 3000-instrument and the 4-factor history are small enough to be read and processed locally: we do not need to use the power of parallel computing to proceed. Note that this is true also for larger cases with hundreds of thousands of instruments and thousands of factors. The need for a distributed system like Spark comes in when actually running the Monte Carlo simulations, which can require massive amounts of computation on each instrument.

          Question 4

          Question 4.1

          Write a function named `readInvestingDotComHistory` to parse data from investing.com based on the format specified above (see Section 5.3). Recall that we use two factors here: one that is related to the price of crude oil, one that is related to some specific US bonds.
            Print the first 5 entries of the first factor (crude oil price) in the parsed data.
              Note that we are only interested in the date and price of stocks.

              HINT

              You can parse a string to datetime object by using the function strptime(<string>, <dtime_format>). In this case, the datetime format is "%b %d, %Y". For more information, please follow this link.

              In the next cell, we simply copy data from our HDFS cluster (that contains everything we need for this Notebook) to the instance (a Docker container) running your Notebook. This means that you will have "local" data that you can process without using Spark. Note the folder location: find and verify that you have correctly downloaded the files!

              In [ ]:
              ! [ -d monte-carlo-risk ] || (echo "Downloading prepared data from HDFS. Please wait..." ; hdfs dfs -copyToLocal /datasets/monte-carlo-risk . ; echo "Done!";)
              
              In [72]:
              # read data from local disk
              def readInvestingDotComHistory(fname):
                  def process_line(line):
                      cols = line.split('\t')
                      date = datetime.strptime(cols[0], "%b %d, %Y")
                      value = float(cols[1])
                      return (date, value)
                  
                  with open(fname) as f:
                      content_w_header = f.readlines()
                      # remove the first line 
                      # and reverse lines to sort the data by date, in ascending order
                      content = content_w_header[:0:-1]
                      return list(map(process_line , content))
              
              factor1_files = ['crudeoil.tsv', 'us30yeartreasurybonds.tsv']
              factor1_files = map(lambda fn: factors_folder + fn, factor1_files)
              factors1 = [readInvestingDotComHistory(f) for f in factor1_files]
              print("The first five elements of the first factor (crude oil) are:")
              display(pd.DataFrame(factors1[0][:5], columns=['Date', 'Value']))
              
              The first five elements of the first factor (crude oil) are:
              
              Date Value
              0 2006-01-26 66.26
              1 2006-01-27 67.76
              2 2006-01-30 68.35
              3 2006-01-31 67.92
              4 2006-02-01 66.56

              Now, the data structure factors1 is a list, containing data that pertains to two (out of a total of four) factors that influence the market, as obtained by investing.com. Each element in the list is a tuple, containing some sort of timestamp, and the value of one of the two factors discussed above. From now on, we call these elements "records" or "entries". Visually, factors1 looks like this:

              0 (crude oil) 1 (US bonds)
              time_stamp, value time_stamp, value
              ... ...
              time_stamp, value time_stamp, value
              ... ...

              Question 4.2

              Write a function named `readYahooHistory` to parse data from yahoo.com based on its format, as described in Section 5.3.
                Print the first 5 entries of the first factor (namely GSPC). Comment the time range of the second batch of data we use in our Notebook.
                  Note that we are only interested in the date and price of stocks.

                  NOTE
                  The datetime format now is in a different format than the previous one.

                  HINT
                  Use a terminal (or put the bash commands inline in your Notebook) to list filenames in your local working directory to find and have a look at your local files.

                  In [73]:
                  # read data from local disk
                  def readYahooHistory(fname):
                      def process_line(line):
                          cols = line.split(',')#
                          date = datetime.strptime(cols[0], "%Y-%m-%d")#
                          value = float(cols[1])#
                          return (date, value)
                      
                      with open(fname) as f:
                          content_w_header = f.readlines()
                          # remove the first line 
                          # and reverse lines to sort the data by date, in ascending order
                          content = content_w_header[:0:-1]
                          return list(map(process_line , content))
                      
                  
                  factor2_files = ['GSPC.csv', 'IXIC.csv']
                  factor2_files = map(lambda fn: factors_folder + fn, factor2_files)
                  
                  factors2 = [readYahooHistory(f) for f in factor2_files]
                  
                  print("GSPC:")
                  display(pd.DataFrame(factors2[0][:5], columns=['Date', 'Value']))
                  print("IXIC:")
                  display(pd.DataFrame(factors2[1][:5], columns=['Date', 'Value']))
                  
                  GSPC:
                  
                  Date Value
                  0 1950-01-03 16.66
                  1 1950-01-04 16.85
                  2 1950-01-05 16.93
                  3 1950-01-06 16.98
                  4 1950-01-09 17.08
                  IXIC:
                  
                  Date Value
                  0 1971-02-05 100.000000
                  1 1971-02-08 100.839996
                  2 1971-02-09 100.760002
                  3 1971-02-10 100.690002
                  4 1971-02-11 101.449997
                  COMMENT
                  We can see that there are missing days (8th and 9th January) that obviously correspond to the weekend. We can check that using the function strftime().
                  In [74]:
                  print("January 7th, 1950 is", datetime(1950, 1, 7).strftime("%A"))
                  print("January 8th, 1950 is", datetime(1950, 1, 8).strftime("%A"))
                  
                  January 7th, 1950 is Saturday
                  January 8th, 1950 is Sunday
                  
                  In addition, the time range is not the same:
                  In [75]:
                  print("Time range of GSPC: [", factors2[0][0][0].strftime("%Y"), ",", factors2[0][-1][0].strftime("%Y"), "]")
                  print("Time range of IXIC: [", factors2[1][0][0].strftime("%Y"), ",", factors2[1][-1][0].strftime("%Y"), "]")
                  
                  Time range of GSPC: [ 1950 , 2016 ]
                  Time range of IXIC: [ 1971 , 2016 ]
                  

                  Now, the data structure factors2 is again list, containing data that pertains to the next two (out of a total of four) factors that influence the market, as obtained by Yahoo!. Each element in the list is a tuple, containing some sort of timestamp, and the value of one of the two factors discussed above. Visually, factors2 looks like this:

                  0 (GSPC) 1 (IXIC)
                  time_stamp, value time_stamp, value
                  ... ...
                  time_stamp, value time_stamp, value
                  ... ...

                  Stock data pre-processing

                  Next, we prepare the data for the instruments we consider in this Notebook (i.e., the stocks we want to invest in).

                  Question 4.3

                  In this Notebook, we assume that we want to invest on the first 35 stocks out of the total 3000 stocks present in our datasets.
                    Load and prepare all the data for the considered instruments (the first 35 stocks) which have historical information for more than 5 years. This means that all instruments with less than 5 years of history should be removed.

                    HINT
                    we suggest to open a terminal window (not on your local machine, but the Notebook terminal that you can find on the Jupyter dashboard) and visually check the contents of the directories holding our dataset, if you didn't do this before! Have a look at how stock data is organized!

                    In [76]:
                    ##### the data is similar to the one in Yahoo...
                    def process_stock_file(fname):
                        try:
                            return readYahooHistory(fname)
                        except Exception as e:
                            raise e
                            return None
                    
                    # select path of all stock data files in "stock_folder"
                    files = [join(stock_folder, f) for f in listdir(stock_folder) if isfile(join(stock_folder, f))]
                    
                    # assume that we invest only the first 35 stocks (for faster computation)
                    files = files[:35]
                    
                    # read each line in each file, convert it into the format: (date, value)
                    rawStocks = [process_stock_file(f) for f in files]
                    
                    print("Initial number of stocks:", len(rawStocks))
                    
                    # select only instruments which have more than 5 years of history
                    # Note: the number of business days in a year is 260
                    number_of_years = 5
                    rawStocks = list(filter(lambda instrument: len(instrument) >= 260 * number_of_years  , rawStocks))
                    
                    print("Number of stocks after filtering:", len(rawStocks))
                    
                    # For testing, print the first 5 entry of the first stock
                    print("The first 5 entries of the first stock:")
                    display(pd.DataFrame(rawStocks[0][:5], columns=['Date', 'Value']))
                    
                    Initial number of stocks: 35
                    Number of stocks after filtering: 29
                    The first 5 entries of the first stock:
                    
                    Date Value
                    0 1997-08-14 39.0
                    1 1997-08-15 42.0
                    2 1997-08-18 44.0
                    3 1997-08-19 55.5
                    4 1997-08-20 48.0
                    COMMENT
                    After filtering our data, 6 stocks have been filtered (out of 35 stocks). There are 29 remaining stocks.

                    Time alignment for our data

                    Different types of instruments may trade on different days, or the data may have missing values for other reasons, so it is important to make sure that our different histories align. First, we need to trim all of our time series to the same region in time. Then, we need to fill in missing values. To deal with time series that have missing values at the start and end dates in the time region, we simply fill in those dates with nearby values in the time region.

                    Question 4.4

                    Assume that we only focus on the data from 23/01/2009 to 23/01/2014. Write a function named `trimToRegion` to select only the records in that time interval.
                      **Requirements**: after processing, each instrument $i$ has a list of records: $[r_0, r_2,...,r_{m_i}]$ such that $r_0$ and $r_{m_i}$ are assigned, respectively, the first and the last values corresponding to the extremes of the given time interval. For example: $r_0$ should contain the value at date 23/01/2009.
                      In [77]:
                      # note that the data of crude oil and treasury is only available starting from 26/01/2006 
                      start = datetime(year=2009, month=1, day=23)
                      end = datetime(year=2014, month=1, day=23)
                      
                      def isInTimeRegion(entry):
                              (date, value) = entry
                              return date >= start and date <= end
                      
                      def trimToRegion(history, start, end):
                          
                          # only select entries which are in the time region
                          trimmed = list(filter(isInTimeRegion, history))
                          
                          # if the data has incorrect time boundaries, add time boundaries
                          if trimmed[0][0] != start:
                              trimmed.insert(0, (start, trimmed[0][1]))
                          if trimmed[-1][0] != end:
                              trimmed.append((end, trimmed[-1][1]))
                          return trimmed
                          
                      # test our function
                      trimmedStock0  = trimToRegion(rawStocks[0], start, end)
                      # the first 5 records of stock 0
                      print('The first 5 records of stock 0:')
                      display(pd.DataFrame(trimmedStock0[:5], columns=['Date', 'Value']))
                      # the last 5 records of stock 0
                      print('The last 5 records of stock 0:')
                      display(pd.DataFrame(trimmedStock0[-5:], columns=['Date', 'Value']))
                      
                      assert(trimmedStock0[0][0] == start), "the first record must contain the price in the first day of time interval"
                      assert(trimmedStock0[-1][0] == end), "the last record must contain the price in the last day of time interval"
                      
                      The first 5 records of stock 0:
                      
                      Date Value
                      0 2009-01-23 19.400000
                      1 2009-01-26 19.670000
                      2 2009-01-27 19.809999
                      3 2009-01-28 20.469999
                      4 2009-01-29 21.410000
                      The last 5 records of stock 0:
                      
                      Date Value
                      0 2014-01-16 37.369999
                      1 2014-01-17 37.470001
                      2 2014-01-21 37.730000
                      3 2014-01-22 37.779999
                      4 2014-01-23 37.590000

                      Dealing with missing values

                      We expect that we have the price of instruments and factors in each business day. Unfortunately, there are many missing values in our data: this means that we miss data for some days, e.g. we have data for the Monday of a certain week, but not for the subsequent Tuesday. So, we need a function that helps filling these missing values.

                      Next, we provide to you the function to fill missing value: read it carefully!

                      In [78]:
                      def fillInHistory(history, start, end):
                          curr = history
                          filled = []
                          idx = 0
                          curDate = start
                          numEntries = len(history)
                          while curDate < end:
                              
                              # if the next entry is in the same day
                              # or the next entry is at the weekend
                              # but the curDate has already skipped it and moved to the next monday
                              # (only in that case, curr[idx + 1][0] < curDate )
                              # then move to the next entry
                              while idx + 1 < numEntries and curr[idx + 1][0] <= curDate:
                                  idx +=1
                      
                              # only add the last value of instrument in a single day
                              # check curDate is weekday or not
                              # 0: Monday -> 5: Saturday, 6: Sunday
                              if curDate.weekday() < 5:
                                  
                                  filled.append((curDate, curr[idx][1]))
                                  # move to the next business day
                                  curDate += timedelta(days=1)
                              
                              # skip the weekends
                              elif curDate.weekday() >= 5:
                                  # if curDate is Sat, skip 2 days, otherwise, skip 1 day
                                  curDate += timedelta(days=(7-curDate.weekday()))
                      
                          return filled
                      

                      Question 4.5

                      Trim data of stocks and factors into the given time interval.
                      In [80]:
                      # trim into a specific time region
                      # and fill up the missing values
                      stocks = list(map(lambda stock:
                                  fillInHistory(
                                      trimToRegion(stock, start, end), 
                                  start, end), 
                              rawStocks))
                      
                      # merge two factors, trim each factor into a time region
                      # and fill up the missing values
                      allfactors = factors1 + factors2
                      factors = list(map(lambda stock:
                                  fillInHistory(
                                      trimToRegion(stock, start, end), 
                                  start, end), 
                                  allfactors
                                  ))
                                  
                      # test our code
                      print("The first 5 records of stock 0:")
                      display(pd.DataFrame(stocks[0][:5], columns=['Date', 'Value']))
                      
                      print("The last 5 records of stock 0:")
                      display(pd.DataFrame(stocks[0][-5:], columns=['Date', 'Value']))
                      
                      print("The first 5 records of factor 0:")
                      display(pd.DataFrame(factors[0][:5], columns=['Date', 'Value']))
                      
                      print("The last 5 records of factor 0:")
                      display(pd.DataFrame(factors[0][-5:], columns=['Date', 'Value']))
                      
                      The first 5 records of stock 0:
                      
                      Date Value
                      0 2009-01-23 19.400000
                      1 2009-01-26 19.670000
                      2 2009-01-27 19.809999
                      3 2009-01-28 20.469999
                      4 2009-01-29 21.410000
                      The last 5 records of stock 0:
                      
                      Date Value
                      0 2014-01-16 37.369999
                      1 2014-01-17 37.470001
                      2 2014-01-20 37.470001
                      3 2014-01-21 37.730000
                      4 2014-01-22 37.779999
                      The first 5 records of factor 0:
                      
                      Date Value
                      0 2009-01-23 46.47
                      1 2009-01-26 45.73
                      2 2009-01-27 41.58
                      3 2009-01-28 42.16
                      4 2009-01-29 41.44
                      The last 5 records of factor 0:
                      
                      Date Value
                      0 2014-01-16 93.96
                      1 2014-01-17 94.37
                      2 2014-01-20 93.93
                      3 2014-01-21 94.99
                      4 2014-01-22 96.73

                      Recall that Value at Risk (VaR) deals with losses over a particular time horizon. We are not concerned with the absolute prices of instruments, but how those prices change over a given period of time. In our project, we will set that length to two weeks: we use the sliding window method to transform time series of prices into an overlapping sequence of price change over two-week intervals.

                      The figure below illustrates this process. The returns of market factors after each two-week interval is calculated in the very same way.

                      In [81]:
                      def buildWindow(seq, k=2):
                          "Returns a sliding window (of width k) over data from iterable data structures"
                          "   s -> (s0,s1,...s[k-1]), (s1,s2,...,sk), ...                   "
                          it = iter(seq)
                          result = tuple(islice(it, k))
                          if len(result) == k:
                              yield result  
                          for elem in it:
                              result = result[1:] + (elem,)
                              yield result
                      

                      Question 4.6

                      Compute the returns of the stocks after each two-week time window.
                      In [82]:
                      def calculateReturn(window):
                          # return the change of value after two weeks
                          return window[-1][1] - window[0][1]
                      
                      def twoWeekReturns(history):
                          # we use 11 instead of 14 to define the window
                          # because financial data does not include weekends
                          # so there are 10 days
                          # but we need to include the 11th day to calculate the change in values after each two-weeks as indicated above
                          return [calculateReturn(entry) for entry in buildWindow(history, 11)] 
                      
                      stocksReturns = list(map(twoWeekReturns, stocks))
                      factorsReturns = list(map(twoWeekReturns, factors))
                      
                      # test our functions
                      print("the first 5 returns of stock 0:")
                      display(pd.DataFrame(stocksReturns[0][:5], columns=['Return']))
                      
                      print("the last 5 returns of stock 0:")
                      display(pd.DataFrame(stocksReturns[0][-5:], columns=['Return']))
                      
                      the first 5 returns of stock 0:
                      
                      Return
                      0 1.270000
                      1 0.860001
                      2 0.380002
                      3 -0.639999
                      4 -1.730000
                      the last 5 returns of stock 0:
                      
                      Return
                      0 -1.630001
                      1 -1.059998
                      2 -1.459999
                      3 -0.580001
                      4 -0.410000
                      COMMENT
                      There was an error in building the window, we should use 11 days instead of 10 to have a correct two-week time window. Let's make sure of that:
                      In [83]:
                      windowOfStock0 = list(buildWindow(stocks[0], 11))[0]
                      display(pd.DataFrame({'Date': [x[0].strftime("%A, %B %d, %Y") for x in windowOfStock0] ,'Value': [x[1] for x in windowOfStock0]}, columns=['Date','Value']))
                      
                      Date Value
                      0 Friday, January 23, 2009 19.400000
                      1 Monday, January 26, 2009 19.670000
                      2 Tuesday, January 27, 2009 19.809999
                      3 Wednesday, January 28, 2009 20.469999
                      4 Thursday, January 29, 2009 21.410000
                      5 Friday, January 30, 2009 20.100000
                      6 Monday, February 02, 2009 20.360001
                      7 Tuesday, February 03, 2009 21.830000
                      8 Wednesday, February 04, 2009 21.270000
                      9 Thursday, February 05, 2009 20.200001
                      10 Friday, February 06, 2009 20.670000
                      As you can see, when we use 11 days in the parameters, we can easily work out the return from Friday, January 23, 2009 to Friday, February 06, 2009 for instance as indicated in the explanation. Using only 10 days would be wrong.

                      Alright! Now we have data that is properly aligned to start the training process: stocks' returns and factors' returns, per time windows of two weeks. Next, we will apply the MCS method.

                      5.5. Summary guidelines to apply the MCS method on the data we prepared

                      Next, we overview the steps that you have to follow to build a model of your data, and then use Monte Carlo simulations to produce output distributions:

                      • Step 1: Defining the relationship between the market factors and the instrument's returns. This relationship takes the form of a model fitted to historical data.
                      • Step 2: Defining the distributions for the market conditions (particularly, the returns of factors) that are straightforward to sample from. These distributions are fitted to historical data.
                      • Step 3: Generate the data for each trial of a Monte Carlo run: this amount to generating the random values for market conditions along with these distributions.
                      • Step 4: For each trial, from the above values of market conditions, and using the relationship built in step 1, we calculate the return for each instrument and the total return. We use the returns to define an empirical distribution over losses. This means that, if we run 100 trials and want to estimate the 5% VaR, we would choose it as the loss from the trial with the fifth greatest loss.
                      • Step 5: Evaluating the result

                      5.6. Applying MCS

                      Step 1: Defining relationship between market factors and instrument's returns

                      In our simulation, we will use a simple linear model. By our definition of return, a factor return is a change in the value of a market factor over a particular time period, e.g. if the value of the S&P 500 moves from 2000 to 2100 over a time interval, its return would be 100.

                      A vector that contains the return of 4 market factors is called a market factor vector. Generally, instead of using this vector as features, we derive a set of features from simple transformation of it. In particular, a vector of 4 values is transformed into a vector of length $m$ by function $F$. In the simplest case $F(v) = v$.

                      Denote $v_t$ the market factor vector, and $f_t$ the transformed features of $v_t$ at time $t$.

                      $f_{tj}$ is the value of feature $j$ in $f_t$.

                      Denote $r_{it}$ the return of instrument $i$ at time $t$ and $c_i$ the intercept term of instrument $i$.

                      We will use a simple linear function to calculate $r_{it}$ from $f_t$:

                      $$ r_{it} = c_i + \sum_{j=1}^{m}{w_{ij}*f_{tj}} $$

                      where $w_{ij}$ is the weight of feature $j$ for instrument $i$.

                      All that above means that given a market factor vector, we have to apply featurization and then use the result as a surrogate for calculating the return of the instruments, using the above linear function.

                      There are two questions that we should consider: how we apply featurization to a factor vector? and how to pick values for $w_{ij}$?

                      How we apply featurization to a factor vector? In fact, the instruments' returns may be non-linear functions of the factor returns. So, we should not use factor returns as features in the above linear function. Instead, we transform them into a set of features with different size. In this Notebook, we can include some additional features in our model that we derive from non-linear transformations of the factor returns. We will try adding two more features for each factor return: its square and its square root values. So, we can still assume that our model is a linear model in the sense that the response variable is a linear function of the new features. Note that the particular feature transformation described here is meant to be an illustrative example of some of the options that are available: it shouldn't be considered as the state of the art in predictive financial modeling!!.

                      How to pick values for $w_{ij}$?

                      For all the market factor vectors in our historical data, we transform them to feature vectors. Now, we have feature vectors in many two-week intervals and the corresponding instrument's returns in these intervals. We can use Ordinary Least Square (OLS) regression model to estimate the weights for each instrument such that our linear function can fit to the data. The parameters for OLS function are:

                      • x: The collection of columns where each column is the value of a feature in many two-week interval
                      • y: The return of an instrument in the corresponding time interval of x.

                      The figure below shows the basic idea of the process to build a statistical model for predicting the returns of stock X.

                      Question 5

                      Question 5.1

                      Currently, our data is in form of: $$ factorsReturns= \begin{bmatrix} r_{00} & r_{01} & r_{02} & ... & r_{0k} \\ r_{10} & r_{11} & r_{12} & ... & r_{1k} \\ ... & ... & ... & ... & ... \\ r_{n0} & r_{n1} & r_{n2} & ... & r_{nk}\\ \end{bmatrix} $$
                        $$ stocksReturns= \begin{bmatrix} s_{00} & s_{01} & s_{02} & ... & s_{0k} \\ s_{10} & s_{11} & s_{12} & ... & s_{1k} \\ ... & ... & ... & ... & ... \\ s_{n0} & s_{n1} & s_{n2} & ... & s_{nk}\\ \end{bmatrix} $$
                          Where, $r_{ij}$ is the return of factor $i^{th}$ in time window $j^{th}$, $k$ is the number of time windows, and $n$ is the number of factors. A similar definition goes for $s_{ij}$.
                            In order to use OLS, the parameter must be in form of:
                              $$ x=factorsReturns^T = \begin{bmatrix} r_{00} & r_{10} & ... & r_{n0} \\ r_{01} & r_{11} & ... & r_{n1} \\ r_{02} & r_{12} & ... & r_{n2}\\ ... & ... & ... & ... \\ r_{0k} & r_{1k} & ... & r_{nk}\\ \end{bmatrix} $$
                                Whereas, $y$ can be any row in `stocksReturns`.
                                  So, we need a function to transpose a matrix. Write a function named `transpose` to do just that.
                                  In [84]:
                                  def transpose(matrix):
                                      return np.matrix.tolist(np.array(matrix).T)
                                  
                                  # test function
                                  assert (transpose([[1,2,3], [4,5,6], [7,8,9]]) == [[1, 4, 7], [2, 5, 8], [3, 6, 9]]), "Function transpose runs incorrectly"
                                  

                                  Question 5.2

                                  Write a function named `featurize` that takes a list factor's returns $[x_1, x_2,...,x_k]$ and transform it into a new list of features $[u_1,u_2,..,u_k, v_1, v_2,..,v_k, x_1,x_2,...,x_k]$.
                                    Where, $u_i$ = $\left\{ \begin{array}{ll} x_i^2 & \mbox{if } x_i \geq 0 \\ -x_i^2 & \mbox{if } x_i < 0 \end{array} \right. $
                                      and $v_i$ = $\left\{ \begin{array}{ll} \sqrt{x_i} & \mbox{if } x_i \geq 0 \\ -\sqrt{-x_i} & \mbox{if } x_i < 0 \end{array} \right. $
                                      In [85]:
                                      def featurize(factorReturns):
                                          squaredReturns = np.sign(factorReturns) * np.power(factorReturns, 2)
                                          squareRootedReturns = np.sign(factorReturns) * np.sqrt(np.abs(factorReturns))
                                          # concat new features
                                          return np.matrix.tolist(np.concatenate((squaredReturns, squareRootedReturns, factorReturns)).flatten())
                                      
                                      # test our function
                                      assert (featurize([4, -9, 25]) == [16, -81, 625, 2, -3, 5, 4, -9, 25]), "Function runs incorrectly"
                                      

                                      Question 5.3

                                      Using OLS, estimate the weights for each feature on each stock. What is the shape of `weights` (size of each dimension)? Explain it.
                                      In [86]:
                                      def estimateParams(y, x):
                                          return sm.OLS(y, x).fit().params
                                      
                                      # transpose factorsReturns
                                      factorMat = transpose(factorsReturns)
                                      
                                      # featurize each row of factorMat
                                      factorFeatures = list(map(featurize, factorMat))
                                      
                                      # OLS require parameter is a numpy array
                                      factor_columns = np.array(factorFeatures)
                                      
                                      #add a constant - the intercept term for each instrument i.
                                      factor_columns = sm.add_constant(factor_columns, prepend=True)
                                      
                                      # estimate weights
                                      weights = [estimateParams(stockReturns, factor_columns) for stockReturns in stocksReturns]
                                      
                                      print("Weights:")
                                      display(pd.DataFrame(weights, columns=['Factor {0}'.format(i+1) for i in range(np.array(weights).shape[1])]))
                                      print("The shape of weights:", np.array(weights).shape)
                                      
                                      Weights:
                                      
                                      Factor 1 Factor 2 Factor 3 Factor 4 Factor 5 Factor 6 Factor 7 Factor 8 Factor 9 Factor 10 Factor 11 Factor 12 Factor 13
                                      0 0.009524 -0.000612 -4.021273 0.000043 -2.023628e-05 0.053435 -0.530542 -0.026455 -0.050481 -0.018521 2.350599 0.010435 0.010437
                                      1 -0.113955 0.001569 -0.549022 0.000183 -4.297724e-05 0.170226 0.722078 0.104859 -0.060719 -0.085995 -0.257125 -0.020413 0.027102
                                      2 -0.016672 0.003223 -1.555895 -0.000005 -1.886192e-06 0.090952 -0.148069 -0.015051 0.005516 -0.077671 0.785264 0.004220 0.000412
                                      3 0.045618 0.000734 2.801668 -0.000011 -7.583881e-06 0.041191 1.308099 0.006362 -0.031959 -0.024519 -3.916901 0.001442 0.007631
                                      4 -0.180444 0.003390 13.346642 0.000418 -8.484913e-05 -0.125415 1.766486 0.344207 -0.142279 0.002660 -4.872026 -0.075670 0.032188
                                      5 0.020019 -0.000953 1.539840 0.000019 2.879595e-07 -0.061857 -0.045476 0.027236 0.023662 0.029484 -0.567469 -0.004365 -0.002233
                                      6 0.057446 0.001833 -0.169657 -0.000060 3.511639e-06 0.182956 -0.487748 0.010258 -0.008212 -0.087560 0.087162 0.007789 0.003707
                                      7 -0.011369 0.005037 1.761397 0.000023 -2.668707e-05 0.078045 -0.561006 0.042452 -0.023015 -0.075382 0.725519 -0.005257 0.007593
                                      8 -0.055531 0.003458 3.228843 0.000058 -1.391557e-05 0.060375 0.247638 -0.008279 -0.004961 -0.050568 -1.474523 -0.000960 0.003947
                                      9 -0.043639 0.007076 17.052865 0.000104 1.545063e-05 0.401248 2.910033 -0.049801 0.093655 -0.220513 -12.669944 0.014543 -0.000120
                                      10 -0.008234 -0.008616 4.287728 -0.000061 7.373857e-06 -0.299519 0.963026 -0.009180 -0.010226 0.195023 -2.346364 0.011937 0.008460
                                      11 -0.212248 0.002574 7.806819 0.000069 -6.063152e-05 0.058189 0.008443 0.060999 -0.136634 -0.039764 -1.283979 -0.016608 0.028304
                                      12 -0.010983 0.000884 -0.703307 0.000027 -8.959446e-06 0.070556 0.079193 0.008436 -0.029656 -0.021236 0.196450 -0.001527 0.003890
                                      13 -0.144728 -0.000123 8.465456 0.000114 -3.446265e-05 -0.021887 0.519436 -0.002736 -0.007310 0.006011 -4.220148 0.006583 0.011822
                                      14 -0.039591 0.000352 0.825383 0.000032 -9.680923e-06 0.010249 -0.050635 0.007557 -0.004438 -0.012534 -0.036963 0.000872 0.002069
                                      15 -0.006760 -0.002334 -0.838133 0.000013 1.174405e-05 0.000626 -0.247526 -0.005483 0.029718 0.029333 0.645230 -0.004509 -0.001307
                                      16 -0.042856 -0.000899 1.511250 -0.000067 1.682368e-05 0.014457 0.009678 -0.018555 0.026857 -0.000921 -0.249491 0.010052 -0.004366
                                      17 -0.019518 -0.001572 0.629886 -0.000008 6.657335e-06 -0.044810 0.132431 -0.000502 0.022644 0.036904 -0.346760 0.000605 -0.000851
                                      18 0.012968 -0.001583 -0.905781 0.000031 -1.852933e-07 -0.075573 -0.073777 0.011743 0.001992 0.056625 0.045238 -0.004254 0.001448
                                      19 -0.036065 -0.000631 -0.215562 0.000043 -1.868831e-05 -0.002785 -0.328279 -0.022088 -0.005904 0.001976 0.768727 0.008818 0.006699
                                      20 -0.056616 0.001277 -0.274187 0.000031 -8.978748e-06 0.019159 -0.276666 0.019812 -0.029063 -0.014281 0.446368 -0.003037 0.005602
                                      21 -0.013763 -0.000687 -0.008672 0.000015 1.919703e-06 0.003958 -0.094426 0.023860 -0.002452 0.000040 0.190231 -0.003860 -0.000080
                                      22 -0.058551 0.005041 1.608269 0.000063 -2.616348e-05 0.140465 -0.828678 0.017698 -0.029072 -0.089664 2.057569 -0.006499 0.008567
                                      23 -0.013810 -0.000439 0.315803 0.000054 -8.975821e-06 -0.033329 0.097911 0.038011 -0.016267 -0.005234 -0.426449 -0.006716 0.004357
                                      24 0.012027 -0.001744 4.173791 0.000068 -6.647718e-07 -0.038972 0.823333 0.055367 0.000230 0.045748 -3.059792 -0.012304 0.000125
                                      25 0.043728 0.000611 0.539708 0.000008 -7.768398e-06 -0.057947 -0.260758 -0.002811 0.002719 0.019134 0.207803 -0.000584 0.003460
                                      26 -0.001167 0.000938 11.140432 -0.000307 4.464751e-06 0.260410 0.745307 -0.213423 0.036122 -0.081079 -4.735438 0.068372 0.000804
                                      27 0.492560 0.031885 17.840713 -0.000229 8.699260e-05 0.397719 2.650749 0.001232 -0.001016 -0.444769 -8.803790 0.049969 0.003108
                                      28 -0.064352 -0.001703 -2.532117 0.000037 -4.245829e-06 -0.045013 -0.050529 -0.005511 -0.001680 0.041284 0.565330 0.003911 0.000321
                                      The shape of weights: (29, 13)
                                      
                                      COMMENT
                                      The shape of the weight is $ (29, 13) $.
                                      Each stock has a weight corresponding each factor: the 29 lines refer to the number of stocks we have, and 13 columns refers to the number of factors. Actually, we have only 12 factors, the 13th column is linked to the intercept term, whereas the other 12 columns are linked to the 12 factors.

                                      Step 2: Defining the distributions for the market conditions

                                      Since we cannot define the distributions for the market factors directly, we can only approximate their distribution. The best way to do that, is plotting their value. However, these values may fluctuate quite a lot.

                                      Next, we show how to use the Kernel density estimation (KDE) technique to approximate such distributions. In brief, kernel density estimation is a way of smoothing out a histogram: this is achieved by assigning (or centering) a probability distribution (usually a normal distribution) to each data point, and then summing. So, a set of two-week-return samples would result in a large number of "super-imposed" normal distributions, each with a different mean.

                                      To estimate the probability density at a given point, KDE evaluates the PDFs of all the normal distributions at that point and takes their average. The smoothness of a kernel density plot depends on its bandwidth, and the standard deviation of each of the normal distributions. For a brief introduction on KDE, please refer to this link.

                                      In [87]:
                                      def get_domain_and_density(samples):
                                          vmin = min(samples)
                                          vmax = max(samples)
                                          stddev = np.std(samples)
                                          
                                          domain = np.arange(vmin, vmax, (vmax-vmin)/100)
                                          
                                          # a simple heuristic to select bandwidth
                                          bandwidth = 1.06 * stddev * pow(len(samples), -.2)
                                          
                                          # estimate density
                                          kde = KDEUnivariate(samples)
                                          kde.fit(bw=bandwidth)
                                          density = kde.evaluate(domain)
                                          
                                          return domain, density
                                      
                                      In [88]:
                                      def plotDistribution(samples, plotTitle=""):
                                          domain, density = get_domain_and_density(samples)
                                          
                                          # plot
                                          plt.figure(figsize=(15,5))
                                          plt.plot(domain, density)
                                          plt.xlabel("Two Week Return ($)")
                                          plt.ylabel("Density")
                                          plt.grid(linestyle='--', axis="y")
                                          if (plotTitle):
                                              plt.title(plotTitle)
                                          plt.show()
                                      
                                      plotDistribution(factorsReturns[0], "Two-week Crude oil returns distribution")
                                      plotDistribution(factorsReturns[1], "Two-week Treasury bonds returns distribution")
                                      plotDistribution(factorsReturns[2], "Two-week GSPC returns distribution")
                                      plotDistribution(factorsReturns[3], "Two-week IXIC returns distribution")
                                      

                                      For the sake of simplicity, we can say that our smoothed versions of the returns of each factor can be represented quite well by a normal distribution. Of course, more exotic distributions, perhaps with fatter tails, could fit more closely the data, but it is outside the scope of this Notebook to proceed in this way.

                                      Now, the simplest way to sample factors returns is to use a normal distribution for each of the factors, and sample from these distributions independently. However, this approach ignores the fact that market factors are often correlated. For example, when the price of crude oil is down, the price of treasury bonds is down too. We can check our data to verify about the correlation.

                                      Question 6

                                      Question 6.1

                                      Calculate the correlation between market factors and explain the result.

                                      HINT
                                      function np.corrcoef might be useful.

                                      In [89]:
                                      correlation = np.corrcoef(factorsReturns)
                                      correlation
                                      
                                      Out[89]:
                                      array([[1.        , 0.39463374, 0.4160185 , 0.43962453],
                                             [0.39463374, 1.        , 0.48079465, 0.51134416],
                                             [0.4160185 , 0.48079465, 1.        , 0.92233277],
                                             [0.43962453, 0.51134416, 0.92233277, 1.        ]])
                                      COMMENT
                                      Let's plot the correlation between market factors as a heatmap.
                                      In [90]:
                                      mask = np.zeros_like(correlation)
                                      mask[np.triu_indices_from(correlation)] = True
                                      with sns.axes_style("white"):
                                          ax = sns.heatmap(correlation,
                                                           vmin=0, 
                                                           vmax=1, 
                                                           cmap="Blues", 
                                                           xticklabels=['Crude oil', 'Treasury bonds', 'GSPC', ' '], 
                                                           yticklabels=[' ', 'Treasury bonds', 'GSPC', 'IXIC'], 
                                                           mask=mask, 
                                                           cbar_kws={"orientation": "horizontal"})
                                      
                                      The GSPC and IXIC factors (respectively, the 3rd and 4th factors) are highly correlated ($0.92$, dark blue in the heatmap), which means that there is a linear relation between these two factors. Let's check that by plotting them.
                                      In [91]:
                                      plt.figure(figsize=(20,10))
                                      plt.plot(factorsReturns[2], 'r' , label='GSPC factor')
                                      plt.plot(factorsReturns[3], 'b' , label='IXIC factor')
                                      plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
                                      plt.xlabel('Time window')
                                      plt.ylabel('Return')
                                      plt.show()
                                      
                                      Let's plot again these two factors, but this time we're going to multiply the GSPC factor by $\frac{max(IXIC)}{max(GSPC)}$.
                                      In [92]:
                                      plt.figure(figsize=(20,10))
                                      plt.plot(np.array(factorsReturns[2]) * max(factorsReturns[3]) / max(factorsReturns[2]), 'r' , label='GSPC factor')
                                      plt.plot(factorsReturns[3], 'b' , label='IXIC factor')
                                      plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
                                      plt.xlabel('Time window')
                                      plt.ylabel('Return')
                                      plt.show()
                                      
                                      Indeed, the GSPC and IXIC factors look alike. We can almost obtain the IXIC factor by putting the GSPC factor in a linear equation. These two factors have the same variations; if the GSPC is down, the IXIC is down as well.
                                      On the other hand, there is a low correlation ($0.39$, light blue in the heatmap) between the crude oil and the treasury bonds factors (respectively, the 1st and the 2nd factors). As shown in the plot below, there is no relation between these two factors.
                                      In [93]:
                                      plt.figure(figsize=(20,10))
                                      plt.plot(factorsReturns[0], 'c' , label='Crude oil')
                                      plt.plot(factorsReturns[1], 'g' , label='Treasury bonds')
                                      plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
                                      plt.xlabel('Time window')
                                      plt.ylabel('Return')
                                      plt.show()
                                      
                                      Let's plot again these two factors, but this time we're going to multiply the Treasury bonds by $\frac{max(Crude \ oil)}{max(Treasury \ bonds)}$.
                                      In [94]:
                                      plt.figure(figsize=(20,10))
                                      plt.plot(factorsReturns[0], 'c', label='Crude oil')
                                      plt.plot(np.array(factorsReturns[1]) * max(factorsReturns[0]) / max(factorsReturns[1]), 'g', label='Treasury bonds')
                                      plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
                                      plt.xlabel('Time window')
                                      plt.ylabel('Return')
                                      plt.show()
                                      
                                      Indeed, the Crude oil and Treasury bonds factors don't look the same at all.

                                      The multivariate normal distribution can help here by taking the correlation information between the factors into account. Each sample from a multivariate normal distribution can be thought of as a vector. Given values for all of the dimensions but one, the distribution of values along that dimension is normal. But, in their joint distribution, the variables are not independent.

                                      For this use case, we can write:

                                      $$ \left(\begin{array}{c}f_{1}\\f_{2}\\f_{3}\\f_{4} \end{array}\right) \sim N \left[ \left( \begin{array}{c} \mu_1\\ \mu_2 \\ \mu_3 \\ \mu_4 \end{array} \right), \left( \begin{array}{cccc} \sigma^2_1 & \rho_{12} \sigma_1\sigma_2 & \rho_{13} \sigma_1\sigma_3 & \rho_{14} \sigma_1\sigma_4 \\ \rho_{12}\sigma_2\sigma_1 & \sigma^2_2 & \rho_{23} \sigma_2\sigma_3 & \rho_{24} \sigma_2\sigma_4\\ \rho_{13} \sigma_3\sigma_1 & \rho_{23} \sigma_3\sigma_2 & \sigma^2_3 & \rho_{34} \sigma_3\sigma_4 \\ \rho_{14} \sigma_4\sigma_1 & \rho_{24} \sigma_4\sigma_2 & \rho_{34} \sigma_3\sigma_4 & \sigma_4^2 \\ \end{array} \right) \right] $$

                                      Or,

                                      $$ f_t \sim N(\mu, \sum) $$

                                      Where $f_1$, $f_2$, $f_3$ and $f_4$ are the market factors, $\sigma_i$ is the standard deviation of factor $i$, $\mu$ is a vector of the empirical means of the returns of the factors and $\sum$ is the empirical covariance matrix of the returns of the factors.

                                      The multivariate normal is parameterized with a mean along each dimension and a matrix describing the covariance between each pair of dimensions. When the covariance matrix is diagonal, the multivariate normal reduces to sampling along each dimension independently, but placing non-zero values in the off-diagonals helps capture the relationships between variables. Whenever having the mean of this multivariate normal distribution and its covariance matrix, we can generate the sample values for market factors.

                                      Next, we will calculate the mean and the covariance matrix of this multivariate normal distribution from the historical data.

                                      Question 6.2

                                      Calculate the covariance matrix $\sum$ and the means $\mu$ of factors' returns then generate a random vector of factors return that follows a multivariate normal distribution $\sim N(\mu, \sum)$

                                      HINT
                                      Function np.cov can help calculating covariance matrix. Function np.random.multivariate_normal(<mean>, <cov>) is often used for generating samples.

                                      In [95]:
                                      factorCov = np.cov(factorsReturns)
                                      factorMeans = [sum(u) / len(u) for u in factorsReturns]
                                      sample = np.random.multivariate_normal(factorMeans, factorCov)
                                      print("Covariance matrix: ", factorCov)
                                      print("Mean: ", factorMeans)
                                      print("Sample: ", sample)
                                      
                                      Covariance matrix:  [[2.25254567e+01 3.02486401e-01 7.43391337e+01 1.79405865e+02]
                                       [3.02486401e-01 2.60825277e-02 2.92349610e+00 7.10078003e+00]
                                       [7.43391337e+01 2.92349610e+00 1.41754254e+03 2.98588865e+03]
                                       [1.79405865e+02 7.10078003e+00 2.98588865e+03 7.39325758e+03]]
                                      Mean:  [0.3983848531684707, 0.0021514683153013984, 7.754126759659972, 20.82364745285935]
                                      Sample:  [  -4.35704895   -0.32393641  -43.72249535 -124.96987096]
                                      
                                      COMMENT
                                      Let's compare our approximation with the real factors. We're going to slightly modify our previous function plotDistribution. In addition, we're going to compare two probability distributions. Therefore, we're going to use some factors to do that:
                                      • The first one is going to be the mean squared error which is the average of the squares of the errors that are the differences between the real value and what is estimated.
                                      • The second one is the entropy (also called Kullback–Leibler divergence), if it equals 0, we can expect similar behaviors, if not the same, behavior of two different distributions, while a Kullback–Leibler divergence of 1 indicates that the two distributions behave in such a different manner that the expectation given the first distribution approaches zero.
                                      • The third one is going to be Q–Q (quantile-quantile) plot (a probability plot). It's a graphical method for comparing two probability distributions by plotting their quantiles against each other. At the beginning, we choose the set of intervals for the quantiles. A point $(x, y)$ on the plot corresponds to one of the quantiles of the second distribution ($y$-coordinate) plotted against the same quantile of the first distribution ($x$-coordinate). Thus the line is a parametric curve with the parameter which is the number of the interval for the quantile. If the two distributions being compared are similar, the points in the Q–Q plot will approximately lie on the line $y = x$. If the distributions are linearly related, the points in the Q–Q plot will approximately lie on a line, but not necessarily on the line $y = x$.
                                      In [96]:
                                      def compare(samples, factorReturns, plotTitle):
                                          domainSamples, densitySamples = get_domain_and_density(samples) # get the domain and the density of the samples
                                          domainFactorReturns, densityFactorReturns = get_domain_and_density(factorReturns) # get the domain and the density of the factor returns
                                            
                                          print("----", plotTitle, "----")
                                          
                                          MSE = mean_squared_error(densityFactorReturns, densitySamples) # Mean squared error
                                          print("Mean Squared Error:", MSE)
                                          
                                          entropy = stats.entropy(densityFactorReturns, densitySamples) # Relative entropy
                                          print("Relative Entropy:", entropy)
                                          
                                          # plot
                                          fig = plt.figure(figsize=(15,10))
                                          fig.add_subplot(211)
                                          
                                          plt.plot(domainFactorReturns, densityFactorReturns, label="Real factor")
                                          plt.plot(domainSamples, densitySamples, label="Appoximation")
                                          plt.xlabel("Two Week Return ($)")
                                          plt.ylabel("Density")
                                          plt.grid(linestyle='--', axis="y")
                                          plt.title(plotTitle)
                                          plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
                                          
                                          #Q-Q plot
                                          ax = fig.add_subplot(212)
                                          qqplot_2samples(np.array(factorReturns), np.array(samples), line='45', ax=ax)
                                          ax.set_xlabel('Real Factor')
                                          ax.set_ylabel('Approximation')
                                          ax.set_title('A Q–Q plot of the real factor versus its approximation.')
                                          plt.show()    
                                      
                                      In [108]:
                                      samples = np.random.multivariate_normal(factorMeans, factorCov, len(factorsReturns[0])).T
                                      compare(samples[0], factorsReturns[0], "Two-week Crude oil returns distribution")
                                      compare(samples[1], factorsReturns[1], "Two-week Treasury bonds returns distribution")
                                      compare(samples[2], factorsReturns[2], "Two-week GSPC returns distribution")
                                      compare(samples[3], factorsReturns[3], "Two-week IXIC returns distribution")
                                      
                                      ---- Two-week Crude oil returns distribution ----
                                      Mean Squared Error: 0.0002654049031039399
                                      Relative Entropy: 0.14081743491710952
                                      
                                      ---- Two-week Treasury bonds returns distribution ----
                                      Mean Squared Error: 0.15932411018739792
                                      Relative Entropy: 0.11000937432279247
                                      
                                      ---- Two-week GSPC returns distribution ----
                                      Mean Squared Error: 1.7261937558590967e-05
                                      Relative Entropy: 0.5797135794836549
                                      
                                      ---- Two-week IXIC returns distribution ----
                                      Mean Squared Error: 2.1724897926628712e-06
                                      Relative Entropy: 0.3529754183452105
                                      
                                      COMMENT
                                      The factors Crude oil, GSPC and IXIC approximations have a very small error, they differ from the real factors by the fact that they are smaller, shifted or both. The approximation of Treasury bonds has a little bit higher error ($0.36), but it's still low.

                                      Speaking of the relative entropy (Kullback–Leibler divergence), the GSPC factor has a relative entropy higher than 0.5 which characterizes the loss of information in this factor. The other factors have a low value of relative entropy, especially the first two ones (Crude oil and Treasury bonds).

                                      The Q-Q plots above shows that our real factors and their approximations with the normal distribution are a little bit similar at the mean. However, they are different in the tails, that's proven by the distance of the blue points from the red line. The distributions differ more in the left tail, especially with the last two factors (GSPC and IXIC). Therefore, our approximation is kind of bad in tail mismatching.

                                      Given the factors we use to compare the factors with their approximations, we can think that it would be better to choose another distribution for some factors. However, choosing a distribution for each factor is not going to help because we would lose the correlation between factors and thus getting away from the reality. Given the shape of the factors, we need a distribution whose shape is similar to the normal distribution but corrects the heavy left tail generated with it.

                                      Step 3&4: Generating samples, running simulation and calculating the VaR

                                      We define some functions that helps us calculating VaR 5%. You will see that the functions below are pretty complicated! This is why we provide a solution for you: however, study them well!!

                                      The basic idea of calculating VaR 5% is that we need to find a value such that only 5% of the losses are bigger than it. That means the 5th percentile of the losses should be VaR 5%.

                                      VaR can sometimes be problematic though, since it does give any information on the extent of the losses which can exceed the VaR estimate. CVar is an extension of VaR that is introduced to deal with this problem. Indeed, CVaR measures the expected value of the loss in those cases where VaR estimate has been exceeded.

                                      In [109]:
                                      def fivePercentVaR(trials):
                                          numTrials = trials.count()
                                          topLosses = trials.takeOrdered(max(round(numTrials/20.0), 1))
                                          return topLosses[-1]
                                      
                                      # an extension of VaR
                                      def fivePercentCVaR(trials):
                                          numTrials = trials.count()
                                          topLosses = trials.takeOrdered(max(round(numTrials/20.0), 1))
                                          return sum(topLosses)/len(topLosses)
                                      
                                      def bootstrappedConfidenceInterval(
                                            trials, computeStatisticFunction,
                                            numResamples, pValue):
                                          stats = []
                                          for i in range(0, numResamples):
                                              resample = trials.sample(True, 1.0)
                                              stats.append(computeStatisticFunction(resample))
                                          stats.sort()
                                          lowerIndex = int(numResamples * pValue / 2 - 1)
                                          upperIndex = int(np.ceil(numResamples * (1 - pValue / 2)))
                                          return (stats[lowerIndex], stats[upperIndex])
                                      

                                      Next, we will run the Monte Carlo simulation 10,000 times, in parallel using Spark. Since your cluster has 12 cores (two Spark worker nodes, each with 6 cores), we can set parallelism = 12 to dispatch simulation on these cores, across the two machines (remember, those are not really "physical machines", they are Docker containers running in our infrastructure).

                                      Question 7

                                      Complete the code below to define the simulation process and calculate VaR 5%.
                                      In [110]:
                                      # RUN SILMULATION
                                      # We added a parameter 'distribFunc' to easily change our disribution later.
                                      # The parameter 'df' meaning 'degrees of freedom' is useful for some distributions
                                      def simulateTrialReturns(numTrials, factorMeans, factorCov, weights, distribFunc = np.random.multivariate_normal, df=None):
                                          trialReturns = []
                                          for i in range(0, numTrials):
                                              # generate sample of factors' returns
                                              trialFactorReturns = (
                                                  distribFunc(factorMeans, factorCov) if (df is None) else distribFunc(factorMeans, factorCov, df=df)
                                              )
                                              
                                              # featurize the factors' returns
                                              trialFeatures = featurize(trialFactorReturns)
                                              
                                              # insert weight for intercept term
                                              trialFeatures.insert(0,1)
                                              
                                              trialTotalReturn = 0
                                              
                                              # calculate the return of each instrument
                                              # then calulate the total of return for this trial features
                                              trialTotalReturn = np.sum(np.dot(weights, trialFeatures))
                                              
                                              trialReturns.append(trialTotalReturn)
                                          return trialReturns
                                      
                                      In [121]:
                                      parallelism = 12
                                      numTrials = 10000
                                      trial_indexes = list(range(0, parallelism))
                                      seedRDD = sc.parallelize(trial_indexes, parallelism)
                                      bFactorWeights = sc.broadcast(weights)
                                      
                                      trials = seedRDD.flatMap(lambda idx: \
                                                      simulateTrialReturns(
                                                          max(int(numTrials/parallelism), 1), 
                                                          factorMeans, factorCov,
                                                          bFactorWeights.value
                                                      ))
                                      trials.cache()
                                      
                                      valueAtRisk = fivePercentVaR(trials)
                                      conditionalValueAtRisk = fivePercentCVaR(trials)
                                      
                                      print("Value at Risk(VaR) 5%:", valueAtRisk)
                                      print("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
                                      
                                      Value at Risk(VaR) 5%: -20.893004969048892
                                      Conditional Value at Risk(CVaR) 5%: -26.324663792435185
                                      
                                      COMMENT
                                      According to our model, given the VaR, the portfolio stands only a $5\%$ chance of losing more than $20.89 \ \$$ over two weeks of investment. In addition, given the CVaR, the average loss in the worst $5\%$ of outcomes is $26.33 \ \$$.
                                      Let's visualize the distribution of returns to see their extemities (are they spikes?) and if they're normally distributed. To do that, we can use the plotDistribution function in the 2nd step (for the individual factors) to plot an estimate of the probability density function for the joint probability distribution using kernel density estimation.
                                      In [122]:
                                      plotDistribution(trials.collect(), "Two-week returns distribution")
                                      
                                      The returns are normally distributed.

                                      The value of VaR depends on how many invested stocks and the chosen distribution of random variables. Assume that we get VaR 5% = -2.66, that means that there is a 0.05 probability that the portfolio will fall in value by more than \$2.66 over a two weeks' period if there is no trading. In other words, the loses are less than \$2.66 over two weeks' period with 95% confidence level. When a loss over two weeks is more than \$2.66, we call it **failure** (or **exception**). Informally, because of 5% probability, we expect that there are only $0.05*W$ failures out of total $W$ windows.

                                      Step 5: Evaluating the results using backtesting method

                                      In general, the error in a Monte Carlo simulation should be proportional to 1/sqrt(n), where n is the number of trials. This means, for example, that quadrupling the number of trials should approximately cut the error in half. A good way to check the quality of a result is backtesting on historical data. Backtesting is a statistical procedure where actual losses are compared to the estimated VaR. For instance, if the confidence level used to calculate VaR is 95% (or VaR 5%), we expect only 5 failures over 100 two-week time windows.

                                      The most common test of a VaR model is counting the number of VaR failures, i.e., in how many windows, the losses exceed VaR estimate. If the number of exceptions is less than selected confidence level would indicate, the VaR model overestimates the risk. On the contrary, if there are too many exceptions, the risk is underestimated. However, it's very hard to observe the amount of failures suggested by the confidence level exactly. Therefore, people try to study whether the number of failures is reasonable or not, or will the model be accepted or rejected.

                                      One common test is Kupiec's proportion-of-failures (POF) test. This test considers how the portfolio performed at many historical time intervals and counts the number of times that the losses exceeded the VaR. The null hypothesis is that the VaR is reasonable, and a sufficiently extreme test statistic means that the VaR estimate does not accurately describe the data. The test statistic is computed as:

                                      $$ -2ln\Bigg(\frac{(1-p)^{T-x}p^x}{(1-\frac{x}{T})^{T-x}(\frac{x}{T})^x}\Bigg) $$

                                      where:

                                      $p$ is the quantile-of-loss of the VaR calculation (e.g., in VaR 5%, p=0.05),

                                      $x$ (the number of failures) is the number of historical intervals over which the losses exceeded the VaR

                                      $T$ is the total number of historical intervals considered

                                      Or we can expand out the log for better numerical stability:

                                      $$ \begin{equation} -2\Big((T-x)ln(1-p)+x*ln(p)-(T-x)ln(1-\frac{x}{T})-x*ln(\frac{x}{T})\Big) \end{equation} $$

                                      If we assume the null hypothesis that the VaR is reasonable, then this test statistic is drawn from a chi-squared distribution with a single degree of freedom. By using Chi-squared distribution, we can find the p-value accompanying our test statistic value. If p-value exceeds the critical value of the Chi-squared distribution, we do have sufficient evidence to reject the null hypothesis that the model is reasonable. Or we can say, in that case, the model is considered as inaccurate.

                                      For example, assume that we calculate VaR 5% (the confidence level of the VaR model is 95%) and get value VaR = 2.26. We also observed 50 exceptions over 500 time windows. Using the formula above, the test statistic p-value is calculated and equal to 8.08. Compared to 3.84, the critical value of Chi-squared distribution with one degree of freedom at probability 5%, the test statistic is larger. So, the model is rejected. The critical values of Chi-squared can be found by following this link. However, in this Notebook, it's not a good idea to find the corresponding critical value by looking in a "messy" table, especially when we need to change the confidence level. Instead, from p-value, we will calculate the probability of the test statistic in Chi-square thanks to some functions in package scipy. If the calculated probability is smaller than the quantile of loss (e.g, 0.05), the model is rejected and vice versa.

                                      Question 8

                                      Question 8.1

                                      Write a function to calculate the number of failures, that is when the losses (in the original data) exceed the VaR.

                                      HINT

                                      • First, we need to calculate the total loss in each 2-week time interval
                                      • If the total loss of a time interval exceeds VaR, then we say that our VaR fails to estimate the risk in that time interval
                                      • Return the number of failures

                                      NOTE
                                      The loss is often having negative value, so, be careful when compare it to VaR.

                                      In [119]:
                                      def countFailures(stocksReturns, valueAtRisk):
                                          failures = 0
                                          stocksReturns = np.array(stocksReturns)
                                          # iterate over time intervals
                                          for i in range(0, len(stocksReturns[0])):
                                              # calculate the losses in each time interval
                                              loss = np.sum(stocksReturns[:,i])
                                              
                                              # if the loss exceeds VaR
                                              if loss < valueAtRisk:
                                                  failures += 1
                                          return failures
                                      

                                      Question 8.2

                                      Write a function named `kupiecTestStatistic` to calculate the test statistic which was described in the above equation.
                                      In [115]:
                                      def kupiecTestStatistic(total, failures, confidenceLevel):
                                          if (failures == total): 
                                              return 0. #The model is not reasonable
                                          
                                          if (failures == 0):
                                              return 1.
                                          
                                          failureRatio = failures / total
                                          logNumer = (total - failures) * math.log(1 - confidenceLevel) + failures * math.log(confidenceLevel)
                                          logDenom = (total - failures) * math.log(1 - failureRatio) + failures * math.log(failureRatio)
                                          return -2 * (logNumer - logDenom)
                                          
                                      # test the function
                                      assert (round(kupiecTestStatistic(250, 36, 0.1), 2) == 4.80), "function kupiecTestStatistic runs incorrectly"
                                      

                                      Now we can find the p-value accompanying our test statistic value.

                                      In [123]:
                                      def kupiecTestPValue(stocksReturns, valueAtRisk, confidenceLevel):
                                          failures = countFailures(stocksReturns, valueAtRisk)
                                          N = len(stocksReturns)
                                          print("The number of failures:", failures)
                                          total = len(stocksReturns[0])
                                          print("The failures percentage: ", 100 * (failures / total), "%")
                                          testStatistic = kupiecTestStatistic(total, failures, confidenceLevel)
                                          #return 1 - stats.chi2.cdf(testStatistic, 1.0)
                                          return stats.chi2.sf(testStatistic, 1.0)
                                      
                                      varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
                                      cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
                                      print("Value at Risk(VaR) 5%:", valueAtRisk)
                                      print("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
                                      print("VaR confidence interval: " , varConfidenceInterval)
                                      print("CVaR confidence interval: " , cvarConfidenceInterval)
                                      print("Kupiec test p-value: " , kupiecTestPValue(stocksReturns, valueAtRisk, 0.05))
                                      
                                      Value at Risk(VaR) 5%: -20.893004969048892
                                      Conditional Value at Risk(CVaR) 5%: -26.324663792435185
                                      VaR confidence interval:  (-21.235011483810702, -20.130960567287893)
                                      CVaR confidence interval:  (-26.99536094124743, -25.55870550470237)
                                      The number of failures: 108
                                      The failures percentage:  8.346213292117465 %
                                      Kupiec test p-value:  4.167169632280711e-07
                                      

                                      Question 8.3

                                      Discuss the results you have obtained
                                      • Through bootstrapping, we got a confidence interval on our VaR statistic of $[-21.24, -20.13]$. Our VaR is $-20.89$; it lies within the confidence interval. On the other hand, we got a confidence interval on our CVaR statistic of $[-27.00, -25.56]$ and our CVaR ($-26.33$) lies within the interval.
                                      • We got a failure precentage of $8.35 \%$ which is quite high.
                                      • We got a tiny Kupiec test p-value ($0.42 \times {10}^{-6}$) which is way below than $0.05$. This means that we can reject the null hypothesis that the model is reasonable.
                                      The results indicates that our model doesn't correspond well to reality and it needs a little improvement.

                                      Question 9

                                      Assume that we invest in more than 100 stocks. Use the same market factors as for the previous questions to estimate VaR by running MCS, then validate your result. What is the main observation you have, once you answer this question? When you plan to invest in more instruments, how is your ability to predict the risk going to be affected?
                                      In [137]:
                                      # We added a parameter 'distribFunc' to easily change our disribution later.
                                      # The parameter 'df' meaning 'degrees of freedom' is useful for some distributions
                                      # The parameter 'plotDistrib' is a boolean that indicates whether we want to plot the compairaison of the factors with their approximations or not.
                                      def invest(numberOfStocks, numTrials = 10000, distribFunc = np.random.multivariate_normal, plotDistrib = False, df=None):
                                          # select path of all stock data files in "stock_folder"
                                          files = [join(stock_folder, f) for f in listdir(stock_folder) if isfile(join(stock_folder, f))]
                                      
                                          # assume that we invest only the first 'numberOfStocks' stocks (for faster computation)
                                          files = files[:numberOfStocks]
                                      
                                          # read each line in each file, convert it into the format: (date, value)
                                          rawStocks = [process_stock_file(f) for f in files]
                                      
                                          # select only instruments which have more than 5 years of history
                                          # Note: the number of business days in a year is 260
                                          number_of_years = 5
                                          rawStocks = list(filter(lambda instrument: len(instrument) >= 260 * number_of_years  , rawStocks))
                                      
                                          # note that the data of crude oil and treasury is only available starting from 26/01/2006 
                                          start = datetime(year=2009, month=1, day=23)
                                          end = datetime(year=2014, month=1, day=23)
                                      
                                          # trim into a specific time region
                                          # and fill up the missing values
                                          stocks = list(map(lambda stock:
                                                      fillInHistory(
                                                          trimToRegion(stock, start, end), 
                                                      start, end), 
                                                  rawStocks))
                                      
                                          # merge two factors, trim each factor into a time region
                                          # and fill up the missing values
                                          allfactors = factors1 + factors2
                                          factors = list(map(lambda stock:
                                                      fillInHistory(
                                                          trimToRegion(stock, start, end), 
                                                      start, end), 
                                                      allfactors
                                                      ))
                                      
                                          stocksReturns = list(map(twoWeekReturns, stocks))
                                          factorsReturns = list(map(twoWeekReturns, factors))
                                      
                                          # transpose factorsReturns
                                          factorMat = transpose(factorsReturns)
                                      
                                          # featurize each row of factorMat
                                          factorFeatures = list(map(featurize, factorMat))
                                      
                                          # OLS require parameter is a numpy array
                                          factor_columns = np.array(factorFeatures)
                                      
                                          #add a constant - the intercept term for each instrument i.
                                          factor_columns = sm.add_constant(factor_columns, prepend=True)
                                      
                                          # estimate weights
                                          weights = [estimateParams(stockReturns, factor_columns) for stockReturns in stocksReturns]
                                          
                                          factorCov = np.cov(factorsReturns)
                                          factorMeans = [sum(u) / len(u) for u in factorsReturns]
                                          
                                          # Compare the real factor with its approximation
                                          if (plotDistrib):
                                              samples = (
                                                  distribFunc(factorMeans, factorCov, len(factorsReturns[0])).T if (df is None) else distribFunc(factorMeans, factorCov, len(factorsReturns[0]), df=df).T
                                              )
                                              compare(samples[0], factorsReturns[0], "Two-week Crude oil returns distribution")
                                              compare(samples[1], factorsReturns[1], "Two-week Treasury bonds returns distribution")
                                              compare(samples[2], factorsReturns[2], "Two-week GSPC returns distribution")
                                              compare(samples[3], factorsReturns[3], "Two-week IXIC returns distribution")
                                          
                                          parallelism = 12
                                          trial_indexes = list(range(0, parallelism))
                                          seedRDD = sc.parallelize(trial_indexes, parallelism)
                                          bFactorWeights = sc.broadcast(weights)
                                      
                                          trials = seedRDD.flatMap(lambda idx: \
                                                          simulateTrialReturns(
                                                              max(int(numTrials/parallelism), 1), 
                                                              factorMeans, factorCov,
                                                              bFactorWeights.value,
                                                              distribFunc,
                                                              df
                                                          ))
                                          trials.cache()
                                      
                                          valueAtRisk = fivePercentVaR(trials)
                                          conditionalValueAtRisk = fivePercentCVaR(trials)
                                      
                                          varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
                                          cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
                                          kupiecTestPValueReturn = kupiecTestPValue(stocksReturns, valueAtRisk, 0.05)
                                      
                                          print("Value at Risk(VaR) 5%:", valueAtRisk)
                                          print("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
                                          print("VaR confidence interval: " , varConfidenceInterval)
                                          print("CVaR confidence interval: " , cvarConfidenceInterval)
                                          print("Kupiec test p-value: " , kupiecTestPValueReturn)
                                          
                                          return kupiecTestPValueReturn
                                      
                                      In [126]:
                                      pValue100 = invest(100)
                                      
                                      The number of failures: 202
                                      The failures percentage:  15.610510046367851 %
                                      Value at Risk(VaR) 5%: -692.7084344116296
                                      Conditional Value at Risk(CVaR) 5%: -1047.7032171427363
                                      VaR confidence interval:  (-723.385547050537, -663.49604966468)
                                      CVaR confidence interval:  (-1111.0626114585991, -996.0807303011258)
                                      Kupiec test p-value:  1.0885893790388677e-45
                                      
                                      COMMENT
                                      According to our model, given the VaR, the portfolio stands only a $5\%$ chance of losing more than $692.71 \ \$$ over two weeks of investment. In addition, given the CVaR, the average loss in the worst $5\%$ of outcomes is $1047.70 \ \$$. On the other hand, this time, our results are worse:
                                      • We got a confidence interval on our VaR statistic of $[-723.39, -663.50]$ and it's larger than before because we have more stocks in our portfolio and consequently the risk is higher than before.
                                      • We got a higher failure precentage of $15.61 \%$.
                                      • We got a smaller Kupiec test p-value ($0.11 \times {10}^{-44}$).
                                      We can see that as we increase the number of stocks, the VaR, the failures increase. We expected a higher p-value but it decreased, this can be due to extreme values in our stock. Anyway, our model is unable to predict risks when investing in more instruments because our model is very simplified (a normal probability distribution to approximate our four factors - we have only four factors but in reality we need more) and even if we increase the number of stocks and get rid of extreme values, predicitons won't improve much.
                                      Out of curiosity, we're going to invest on 2000 stocks:
                                      In [138]:
                                      pValue2000 = invest(2000)
                                      
                                      The number of failures: 151
                                      The failures percentage:  11.669242658423492 %
                                      Value at Risk(VaR) 5%: -1539.124227050596
                                      Conditional Value at Risk(CVaR) 5%: -2046.2205728325878
                                      VaR confidence interval:  (-1588.4676859486665, -1489.0063531897497)
                                      CVaR confidence interval:  (-2116.819018370491, -1991.0269483515906)
                                      Kupiec test p-value:  2.980686240657445e-21
                                      
                                      The VaR and CVaR are higher as expected. However, the number of failures decreased (and thus the failures percentage); this can be explained by the fact that we have a large number of stocks and extreme values have minor effects comparing to the other cases where the number of stocks was small. Futhermore, the Kupiec test p-value increased as well (to $0.30 \times {10}^{-21}$) but it's still way below $0.05$.

                                      Question 10

                                      In the previous questions, we used the normal distributions to sample the factors returns. Try to study how results vary when selecting other probability distributions: our goal is to improve the result of our MCS.
                                      Until now, we used the normal distribution as an approximation to our factors returns. The normal distribution is the large sample distribution in many meaningful statistical problems that involve some version of the Central Limit Theorem. We're going to use distributions that are well adapted to the shape of gaussian curve.
                                      Distribution Multivariate Student distribution
                                      We're going to use the Multivariate Student distribution. It is a continuous probability distribution used when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown. Unlike the normal distribution, Students densities are more peaked around the center and have fatter tails.
                                      In [141]:
                                      def multivariate_t_rvs(m, S, n=1, df=np.inf): 
                                          '''generate random variables of multivariate t distribution
                                          Code soure: https://stackoverflow.com/questions/41957633/sample-from-a-multivariate-t-distribution-python
                                          Parameters
                                          ----------
                                          m : array_like
                                              mean of random variable, length determines dimension of random variable
                                          S : array_like
                                              square array of covariance  matrix
                                          df : int or float
                                              degrees of freedom
                                          n : int
                                              number of observations, return random array will be (n, len(m))
                                          Returns
                                          -------
                                          rvs : ndarray, (n, len(m))
                                              each row is an independent draw of a multivariate t distributed
                                              random variable
                                          '''
                                          m = np.asarray(m)
                                          d = len(m)
                                          if df == np.inf:
                                              x = 1.
                                          else:
                                              x = np.random.chisquare(df, n)/df
                                          z = np.random.multivariate_normal(np.zeros(d),S,(n,))
                                          return m + z/np.sqrt(x)[:,None]   # same output format as random.multivariate_normal
                                      
                                      In [144]:
                                      pValueTRVS35 = invest(35, distribFunc = multivariate_t_rvs, plotDistrib = True, df = 4)
                                      
                                      ---- Two-week Crude oil returns distribution ----
                                      Mean Squared Error: 0.0001793478735773678
                                      Relative Entropy: 0.08699756806474758
                                      
                                      ---- Two-week Treasury bonds returns distribution ----
                                      Mean Squared Error: 0.23874332313273455
                                      Relative Entropy: 0.11608696894532546
                                      
                                      ---- Two-week GSPC returns distribution ----
                                      Mean Squared Error: 6.843997890618029e-06
                                      Relative Entropy: 0.24221470445204557
                                      
                                      ---- Two-week IXIC returns distribution ----
                                      Mean Squared Error: 1.0391882116516024e-06
                                      Relative Entropy: 0.18299458373434482
                                      
                                      The number of failures: 51
                                      The failures percentage:  3.9412673879443583 %
                                      Value at Risk(VaR) 5%: -28.27217027374513
                                      Conditional Value at Risk(CVaR) 5%: -47.71285560732761
                                      VaR confidence interval:  (-29.199519702249493, -27.312448614720868)
                                      CVaR confidence interval:  (-53.38582211507419, -44.44369306849102)
                                      Kupiec test p-value:  0.07001415820706638
                                      
                                      COMMENT
                                      The error between the factors and their approximations is low, even though in some cases it's higher than the normal distribution. Further, it improved for the last two factors (GSPC and IXIC) where the error is smaller. Speaking of the relative entropy (Kullback–Leibler divergence), we had a relative entropy higher than 0.5 for the third factor (GSPC). With the Student distribution, the entropy decreased very much ($0.24$ and $0.18$ for the two last factors respectively) which means we don't have that loss of information anymore. This allowed us to get a better results concerning the failure percentage and p-value: The failure percentage has decreased from $8.34$ to $3.94$, and the p-value that was so small, is higher now than $0.05$ which means that we maintain the null hypothesis that our model is reasonable.

                                      On the other hand, given the VaR, the portfolio stands only a $5\%$ chance of losing more than $28.27 \ \$$ over two weeks of investment. In addition, given the CVaR, the average loss in the worst $5\%$ of outcomes is $47.71 \ \$$. Furthermore, through bootstrapping, we got a confidence interval on our VaR statistic of $[-29.20, -27.31]$. Our VaR lies within the confidence interval. On the other hand, we got a confidence interval on our CVaR statistic of $[-53.39, -44.44]$ and our CVaR, as well, lies within the interval. The results indicates that our model improved and it's closer to reality.
                                      Distribution Multivariate log-normal
                                      We're going to use the Multivariate log-normal. It is a continuous probability distribution of a random variable whose logarithm is normally distributed. If ${\boldsymbol {X}}\sim {\mathcal {N}}({\boldsymbol {\mu }},\,{\boldsymbol {\Sigma }})$ is a multivariate normal distribution then ${\boldsymbol {Y}}=\exp({\boldsymbol {X}})$ has a multivariate log-normal distribution with mean $\operatorname {E} [{\boldsymbol {Y}}]_{i}=e^{\mu _{i}+{\frac {1}{2}}\Sigma _{ii}}$ and covariance matrix $\operatorname {Var} [{\boldsymbol {Y}}]_{ij}=e^{\mu _{i}+\mu _{j}+{\frac {1}{2}}(\Sigma _{ii}+\Sigma _{jj})}(e^{\Sigma _{ij}}-1)$.

                                      Therefore, in order to implement easily the Multivariate log-normal distribution with the mean $\mu _{i}$ and the covariance $\Sigma$, we're going to use the Multivariate normal distribution ${\mathcal {N}}({\boldsymbol {\mu '}},\,{\boldsymbol {\Sigma '}})$ where $\mu ' _{i} = \log(\mu _{i}) - {\frac {1}{2}} \Sigma ' _{ii}$ and $\Sigma ' _{ij} = \log({\frac {\Sigma _{ij}}{\mu _{i} \mu _{j}} } + 1)$.
                                      In [147]:
                                      def multivariate_lognormal(mean, cov, n=1):
                                          mean_normal = np.zeros(len(mean))
                                          cov_normal = np.zeros((len(cov), len(cov[0])))
                                          
                                          for i in range(cov_normal.shape[0]):
                                              for j in range(cov_normal.shape[1]):
                                                  if (mean[i] * mean[j] != 0):
                                                      cov_normal[i][j] = math.log(1 + cov[i][j] / (mean[i] * mean[j]))
                                          
                                          for i in range(mean_normal.shape[0]):
                                              if (mean[i] > 0):
                                                  mean_normal[i] = math.log(mean[i]) - cov_normal[i][i] / 2
                                                  
                                          return np.random.multivariate_normal(mean_normal, cov_normal, n)
                                      
                                      In [148]:
                                      pValueLogNormal35 = invest(35, distribFunc = multivariate_lognormal, plotDistrib = True)
                                      
                                      /opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:14: RuntimeWarning: covariance is not positive-semidefinite.
                                        
                                      
                                      ---- Two-week Crude oil returns distribution ----
                                      Mean Squared Error: 0.003423439850482823
                                      Relative Entropy: 0.13612050164155287
                                      
                                      ---- Two-week Treasury bonds returns distribution ----
                                      Mean Squared Error: 1.1543133130165826
                                      Relative Entropy: 0.1747861463821812
                                      
                                      ---- Two-week GSPC returns distribution ----
                                      Mean Squared Error: 0.014808723027361265
                                      Relative Entropy: 0.3364080453923412
                                      
                                      ---- Two-week IXIC returns distribution ----
                                      Mean Squared Error: 0.015147104932148197
                                      Relative Entropy: 0.4429337855861849
                                      
                                      The number of failures: 0
                                      The failures percentage:  0.0 %
                                      Value at Risk(VaR) 5%: -19778.67869801265
                                      Conditional Value at Risk(CVaR) 5%: -23059.104029496262
                                      VaR confidence interval:  (-20128.239597140702, -19390.82994912339)
                                      CVaR confidence interval:  (-23455.13703567505, -22666.41375887778)
                                      Kupiec test p-value:  0.31731050786291115
                                      
                                      COMMENT
                                      Even though we got a number of failures equal to $ 0 $, our model is very far from reality. In fact, if we look at the plots comparing the real factor and its log-normal approximation, we can see that the approximation doesn't fit at all the real factor. In addition, the Q-Q plot proves that the approximation is not at all similar to our factors. So it is hard to use log-normal to represent this market factors.
                                      Distribution Gauxian Mixture Model (GMM)
                                      We are going to use now GMM. General Mixture models (GMMs) are an unsupervised probabilistic model composed of multiple distributions (commonly referred to as components) and corresponding weights allowing to model more complex distributions corresponding to a singular underlying phenomena.
                                      In [149]:
                                      def simulateTrialReturnsGMM(numTrials, factorsReturns, weights, n_components=1, covariance_type='tied', tol=0.001):
                                          trialReturns = []
                                          for i in range(0, numTrials):
                                              # generate sample of factors' returns
                                              
                                              models = GaussianMixture(n_components=n_components, covariance_type=covariance_type, tol=tol).fit(np.array(factorsReturns).T)
                                              trialFactorReturns = models.sample(1)[0].reshape(len(factorsReturns))
                                              
                                              # featurize the factors' returns
                                              trialFeatures = featurize(np.matrix.tolist(trialFactorReturns))
                                              
                                              # insert weight for intercept term
                                              trialFeatures.insert(0,1)
                                              
                                              trialTotalReturn = 0
                                              
                                              # calculate the return of each instrument
                                              # then calulate the total of return for this trial features
                                              trialTotalReturn = np.sum(np.dot(weights, trialFeatures))
                                              
                                              trialReturns.append(trialTotalReturn)
                                          return trialReturns
                                      
                                      def investGMM(numberOfStocks, numTrials = 10000, n_components=1, covariance_type='tied', tol=0.001):
                                          # select path of all stock data files in "stock_folder"
                                          files = [join(stock_folder, f) for f in listdir(stock_folder) if isfile(join(stock_folder, f))]
                                      
                                          # assume that we invest only the first 'numberOfStocks' stocks (for faster computation)
                                          files = files[:numberOfStocks]
                                      
                                          # read each line in each file, convert it into the format: (date, value)
                                          rawStocks = [process_stock_file(f) for f in files]
                                      
                                          # select only instruments which have more than 5 years of history
                                          # Note: the number of business days in a year is 260
                                          number_of_years = 5
                                          rawStocks = list(filter(lambda instrument: len(instrument) >= 260 * number_of_years  , rawStocks))
                                      
                                          # note that the data of crude oil and treasury is only available starting from 26/01/2006 
                                          start = datetime(year=2009, month=1, day=23)
                                          end = datetime(year=2014, month=1, day=23)
                                      
                                          # trim into a specific time region
                                          # and fill up the missing values
                                          stocks = list(map(lambda stock:
                                                      fillInHistory(
                                                          trimToRegion(stock, start, end), 
                                                      start, end), 
                                                  rawStocks))
                                      
                                          # merge two factors, trim each factor into a time region
                                          # and fill up the missing values
                                          allfactors = factors1 + factors2
                                          factors = list(map(lambda stock:
                                                      fillInHistory(
                                                          trimToRegion(stock, start, end), 
                                                      start, end), 
                                                      allfactors
                                                      ))
                                      
                                          stocksReturns = list(map(twoWeekReturns, stocks))
                                          factorsReturns = list(map(twoWeekReturns, factors))
                                      
                                          # transpose factorsReturns
                                          factorMat = transpose(factorsReturns)
                                      
                                          # featurize each row of factorMat
                                          factorFeatures = list(map(featurize, factorMat))
                                      
                                          # OLS require parameter is a numpy array
                                          factor_columns = np.array(factorFeatures)
                                      
                                          #add a constant - the intercept term for each instrument i.
                                          factor_columns = sm.add_constant(factor_columns, prepend=True)
                                      
                                          # estimate weights
                                          weights = [estimateParams(stockReturns, factor_columns) for stockReturns in stocksReturns]
                                          
                                          models = GaussianMixture(n_components=n_components, covariance_type=covariance_type, tol=tol).fit(np.array(factorsReturns).T)
                                          samples = models.sample(len(factorsReturns[0]))[0].T
                                          
                                          compare(samples[0], factorsReturns[0], "Two-week Crude oil returns distribution")
                                          compare(samples[1], factorsReturns[1], "Two-week Treasury bonds returns distribution")
                                          compare(samples[2], factorsReturns[2], "Two-week GSPC returns distribution")
                                          compare(samples[3], factorsReturns[3], "Two-week IXIC returns distribution")
                                          
                                          parallelism = 12
                                          trial_indexes = list(range(0, parallelism))
                                          seedRDD = sc.parallelize(trial_indexes, parallelism)
                                          bFactorWeights = sc.broadcast(weights)
                                          bfactorsReturns = sc.broadcast(factorsReturns)
                                      
                                          trials = seedRDD.flatMap(lambda idx: \
                                                          simulateTrialReturnsGMM(
                                                              max(int(numTrials/parallelism), 1), 
                                                              bfactorsReturns.value,
                                                              bFactorWeights.value,
                                                              n_components,
                                                              covariance_type,
                                                              tol
                                                          ))
                                          trials.cache()
                                      
                                          valueAtRisk = fivePercentVaR(trials)
                                          conditionalValueAtRisk = fivePercentCVaR(trials)
                                      
                                          varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
                                          cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
                                          kupiecTestPValueReturn = kupiecTestPValue(stocksReturns, valueAtRisk, 0.05)
                                      
                                          print("Value at Risk(VaR) 5%:", valueAtRisk)
                                          print("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
                                          print("VaR confidence interval: " , varConfidenceInterval)
                                          print("CVaR confidence interval: " , cvarConfidenceInterval)
                                          print("Kupiec test p-value: " , kupiecTestPValueReturn)
                                          
                                          return kupiecTestPValueReturn
                                      
                                      In [150]:
                                      pValueGMM35 = investGMM(35, n_components=2, covariance_type='full', tol=1e-3)
                                      
                                      ---- Two-week Crude oil returns distribution ----
                                      Mean Squared Error: 0.0003185135412432679
                                      Relative Entropy: 0.17526012534294397
                                      
                                      ---- Two-week Treasury bonds returns distribution ----
                                      Mean Squared Error: 0.22983729523550583
                                      Relative Entropy: 0.15232729912918866
                                      
                                      ---- Two-week GSPC returns distribution ----
                                      Mean Squared Error: 2.732129834923721e-07
                                      Relative Entropy: 0.01140952358337333
                                      
                                      ---- Two-week IXIC returns distribution ----
                                      Mean Squared Error: 2.0640384528769058e-07
                                      Relative Entropy: 0.04016663719202814
                                      
                                      The number of failures: 84
                                      The failures percentage:  6.491499227202473 %
                                      Value at Risk(VaR) 5%: -23.305322560116103
                                      Conditional Value at Risk(CVaR) 5%: -32.44720704172719
                                      VaR confidence interval:  (-24.578164611470594, -22.18107162691827)
                                      CVaR confidence interval:  (-33.80630405793223, -31.253164361452868)
                                      Kupiec test p-value:  0.018354973938239805
                                      
                                      COMMENT
                                      With the Gaussian Mixture Model, the approximations are very close to the real factors. The error is low and so is the relative entropy, which means that the loss of information is unlikely to happen.

                                      The Q-Q plots above shows that our real factors and their approximation is very similar at the mean. The differences in the tails are quite insignificant. Our approximation is good at tail matching. Unfortunately, the p-value is lower than $0.05$ which implies that our model does not represent well the reality.

                                      On the other hand, given the VaR, the portfolio stands only a $5\%$ chance of losing more than $23.30 \ \$$ over two weeks of investment. In addition, given the CVaR, the average loss in the worst $5\%$ of outcomes is $32.45 \ \$$. Furthermore, through bootstrapping, we got a confidence interval on our VaR statistic of $[-24.58, -22.18]$. Our VaR lies within the confidence interval. On the other hand, we got a confidence interval on our CVaR statistic of $[-33.81, -31.25]$ and our CVaR, as well, lies within the interval.

                                      6. Summary

                                      In this lecture, we studied the Monte Carlo Simulation method and its application to estimate financial risk. To apply it, first, we needed to define the relationship between market factors and the instruments' returns. In such step, you must define the model which maps the market factors' values to the instruments' values: in our use case, we used a linear regression function for building our model. Next, we also had to find the parameters of our model, which are the weights of the factors we considered. Then, we had to study the distribution of each market factor. A good way to do that is using Kernel density estimation to smooth the distribution and plot it. Depending on the shape of each figure, we had to guess the best fit distribution for each factor: in our use case, we used a very simple approach, and decided that our smoothed distributions all looked normal distributions.

                                      Then, the idea of Monte Carlo simulation was to generate many possible values for each factor and calculate the corresponding outcomes by a well-defined model in each trial. After many trials, we were able to calculate VaR from the sequences of outcome's values. When the number of trials is large enough, the VaR converges to reasonable values, that we could validate using well-known statistical hypothesis.

                                      References